Combinatorial Design of Universal DNA Tag Systems

CPSC 536A Course Project

Progress Report
March 6, 2001

Tim Braun, James Gauthier, Tam Huynh, Louie van de Lagemaat


Extending the Ben-Dor Approach
 

We began the project with the main goal of doing some serious theoretical work on the problem of generating universal DNA tags.  Based on the notion that the Ben-Dor, et. al. method contains an overly simple model of melting temperature (the so-called 2-4 rule), we decided to research the feasibility of using the more accurate Nearest Neighbour Model in the generation of DNA tags.  One of the intricate (and beautiful) results of the Ben-Dor paper is the proof of the near-optimality of the Universal DNA-Tag Generation Method (based on the 2-4 rule).  From the beginning, we were aware that any mathematical proof of optimality or near-optimality of our method would be difficult  --  particularly because the mathematical proof by Ben-Dor, et. al. relies on certain special properties of the model they use.  This makes a simple "adaptation" of Ben-Dor's proof to our method unlikely or impossible.

So far, we have made little progress on a theoretical bound of the "near-optimality" of our algorithm.  This has lead us in an empirical direction  --  to develop and implement the algorithm based on the Nearest Neighbour Model and to examine its behaviour.  We hope that continued experimentation with our Nearest Neighbour method will help us to explore its time/space complexity  --  so that we may, at the very least, hint at its performance.  We have a few ideas on how we might compare the efficiency of our algorithm with that of Ben-Dor's.  Also, we continue to research the properties of the Nearest Neighbour approach to DNA tag generation and have a few suggestions that will help us to further refine our algorithm.
 

Random Tag Generation

One version of our implementation uses a purely random method to generate tags.  This method carries out random tag generation as follows...
 


The melting temperature calculation relies on Needleman-Wunsch alignment.  It is perfomed like this...
 

 
* J. SantaLucia, Jr., H.T. Allawi, and P.A. Seneviratne.  Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability. (1996) Biochemistry (35)  pp. 3555-3562.


Our melting temperature calculation is subject to two assumptions.  The first assumption inherent in this process is that identities found between two distinct tags by the global alignment of two tags correspond to physiological pairing between one tag and the other antitag.  Another assumption is that pairing happens without pseudoknots or loops whereby the 3' end of the tag binds upstream of the 5' end.
 

Implementation of Random Tag Generation

We have successfully implemented an algorithm for producing pseudo-random DNA tags.  Our algorithm generates tags as expected and preliminary results indicate that as more tags are generated, the rate of tag production decreases.  For higher allowed tag simliarities (higher melting temperature) more tags are generated.  Therefore, we have succeeded in implementing an algorithm that produces tags using a more accurate temperature model than prior work by Ben-Dor, et. al. (of course, they used the 2-4 rule).  We would like to directly compare the performance of our tag generator with that of Ben-Dor, et. al.  However, this may not be possible, as the temperature parameter in the Ben-Dor model does not straightforwardly correlate to a melting temperature on the absolute scale (degrees Kelvin).  This is too substantial a difference for direct comparison.  Despite this concern, we have attempted to contact Ben-Dor in order to obtain an implementation of his method and we are anticipating the results from the Universal DNA Tag Generation group.  Our thinking is that we can use the results of the Ben-Dor algorithm as input for our algorithm.  By looking at the Ben-Dor tags that "survive" our method, we will be able to comment on the relative performance of the two algorithms.
 

Other Approaches to Tag Generation

Although the random tag generation approach has produced reasonable results, we felt we should examine other approaches and look for potential improvements in tag generation.  One such approach is the idea of applying the group theoretic concept of "necklaces" to aid in efficient DNA tag generation.
 

Necklaces

In search for a more systematic way of generating tags, we came across a class of combinatorial objects called necklaces. Our goal being to generate distinct tags, these objects seemed promising at first glance.

A k-ary necklace is an equivalence class of k-ary strings under rotation.

The illustration below (taken from [1]) shows the the six binary necklaces with 4 "beads" and their equivalence classes of strings.

In [2] it is shown that necklaces can be generated in constant amortized time by the so-called FKM- Algorithm by Frederickson, Kessler, and Maiorana.  We implemented this algorithm in order to generate 4-ary necklaces, which can be regarded as a subset of DNA-Tags.

The number of k-ary necklaces of length n can be computed by the following formula, (x) being the Euler totient function:


We can use this formula to get a sense of how many 4-ary necklaces exist, compared to the total number of possible tags:
 
 
length of sequence
# of possible tags t # of necklaces n n/t
1 4 4 1
2 16 10 0.625
3 64 24 0.375
4 256 70 0.2734375
5 1024 208 0.203125
6 4096 700 0.170898438
7 16384 2344 0.143066406
8 65536 8230 0.125579834
9 262144 29144 0.111175537
10 1048576 104968 0.100105286
11 4194304 381304 0.090909958
12 16777216 1398500 0.083357096
13 67108864 5162224 0.076923132
14 268435456 19175140 0.071432963
15 1073741824 71582944 0.066666812
16 4294967296 268439590 0.062500963
17 17179869184 1010580544 0.05882353
18 68719476736 3817763740 0.055555774
19 274877906944 14467258264 0.052631579
20 1099511627776 54975633976 0.050000048
21 4398046511104 2.09431E+11 0.047619048
22 17592186044416 7.99645E+11 0.045454556
23 70368744177664 3.05951E+12 0.043478261
24 281474976710656 1.17281E+13 0.041666669

 

For n=10, approximately 10% of all possible tags are necklaces. If they were all distinct enough when evaluated with our implementation of the Nearest Neighbor-Model, this number would be suffiently high.
 

Comparison of Random Tag Approach and Necklace Approach

After implementing both the random tag approach and the necklace approach, we noticed immediately that the random method outperforms the necklace-based method.
Because new necklaces are systematically generated from the previous necklaces through slight modification, a new necklace often contains parts that haven't been altered
and therefore match perfectly with their ancestors. Such a tag will then of course be rejected. We believe this to be the main reason why the necklace algorithm has a 'slow start' and never performs better than the randomized algorithm: a) the necklaces are only a 10% fraction of the space of possible tags and b) the differences between subsequent necklaces are too small - much computational effort is wasted here. For future work, we are focussing on improving this situation.

To illustrate the previous discussion,  a graph is shown below,  plotting the number of found, valid tags ('tags accepted') against the number of 'tries' needed ('generated tags') for both algorithms and various critical Tmelt(). Note the smooth curve for the random algorithm and the jagged, stair-like apperance of the necklace alg. Every time a 'step' is reached there,  the necklaces have changed enough to yield a new group of tags - and after that, the necklaces are too similar again, so the number of found tags is low until the next 'step'. Also important to note is the logarithmic scale on both axis: although there seems to be only a slight advantage for the random tag algorithm in the end, it actually yields about 2-3 times more tags than the necklaces.
 

Legend: The different lines are labelled using the system : "algorithm"-"tag length"-"crit. Tmelt()".
                 Thus "random-10-20" means that this line shows the behaviour of the random algorithm with a tag length of 10 bases and a critical Tmelt() of 20 degree Celsius.
 

Continuing Work

There are several areas where we plan to direct our efforts.

First, we would like to refine our algorithm.  The current implementation of the melting temperature function is extremely inefficient  --  due to an extensive use of string operations.  This did not hurt us in the beginning of the project, but as we increase the tag lengths and implement more complex algorithms, we need to optimize the Tmelt() function to gain as much speed as possible.

Then, we will be able to do a more through evaluation of the impact of different tag lengths on the performance of the two exisiting algorithms  --  as well as for any new algorithm that we might create.  Because the search- (and necklace-) space grows very rapidly as the tag length goes up, eventually, the necklace algorithm will no longer be able to cover it completely.  However, a variation of this method might actually perform better there than the random method. are expected to change the relative

Early on, we abandoned the theoretical aspect of the project.  It appears as though a theoretical proof of optimality or "near optimality" of our method will be difficult  --  if, in fact, it is possible at all.  Still, we plan to revisit the theoretical work and see if we can make any progress on the "road" to proof.  Perhaps we will be able to at least hint in the direction of a proof of near optimality for one of our methods.

Also, we hope to obtain an implementation of the Ben-Dor algorithm (either from Ben-Dor himself, although so far we have not been able to correspond with him, or from the CPSC 536A group working on Universal DNA Tag Generation).  With a working version of the Ben-Dor method, we could do some interesting comparisons of that algorithm with our own algorithms.  For example, we would like to feed his results into our algorithms and see how many of his tags (if any) are disqualified by an implementation that uses the more accurate Nearest Neighbour Model of melting temperature.  This would give us a better idea of the potential benefits of a more complete melting temperature model.

We also want to investigate the "quality" of our melting temperature estimate, because it is vital to all our algorithms - if this measure is erroneous, all of our results are. So we want to use actual melting temperature measurements as test data for our prediction algorithm, comparing the measured melting temperature with our estimated prediction. Unfortunately, the amount of published measurements we found so far is small - but we will continue to search.

Coming forth from the observations of the necklace algorithm and its apparent weaknesses, we have the idea of a 'tag mutation' algorithm, a 'genetic-like' algorithm that actually generates new tag candidates from old ones by  'mutation' of bases. These mutations could be random, but we mainly want to investigate 'guided' mutations, which are deliberately placed in areas where the tag matches well with other tags - hopefully destroying those matches and generating more distinct tags. This might yield a better method than pure random tries, especially if the number of tags required is large.

We have decided that future versions of our algorithms will not allow the production of tag complements.  In practice, there is a high probability that tag complements in solution might bind to each other  --  thus eliminating both tags.  Therefore, tag complements are not really "valid" tags.  In an effort to produce the most effective set of tags possible, we will disallow tag complements and prevent this problem.
 

References:
 

[1]  The (Combinatorial) Object Server:
http://www.theory.csc.uvic.ca/~cos/inf/neck/NecklaceInfo.html
[2] F. Ruskey, C.D. Savage, and T. Wang, "Generating necklaces", J. Algorithms, 13 (1992) 414-430.

[3] J. SantaLucia, Jr., H.T. Allawi, and P.A. Seneviratne.  Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability. (1996) Biochemistry (35)  pp. 3555-3562.


 

Updated March 6. 2001