CPSC 536A Course Project
Progress Report
March 6, 2001
Tim Braun, James Gauthier, Tam Huynh, Louie van de Lagemaat
Extending the Ben-Dor Approach
We began the project with the main goal of doing some serious theoretical work on the problem of generating universal DNA tags. Based on the notion that the Ben-Dor, et. al. method contains an overly simple model of melting temperature (the so-called 2-4 rule), we decided to research the feasibility of using the more accurate Nearest Neighbour Model in the generation of DNA tags. One of the intricate (and beautiful) results of the Ben-Dor paper is the proof of the near-optimality of the Universal DNA-Tag Generation Method (based on the 2-4 rule). From the beginning, we were aware that any mathematical proof of optimality or near-optimality of our method would be difficult -- particularly because the mathematical proof by Ben-Dor, et. al. relies on certain special properties of the model they use. This makes a simple "adaptation" of Ben-Dor's proof to our method unlikely or impossible.
So far, we have made little progress on a theoretical bound of the "near-optimality"
of our algorithm. This has lead us in an empirical direction
-- to develop and implement the algorithm based on the Nearest Neighbour
Model and to examine its behaviour. We hope that continued experimentation
with our Nearest Neighbour method will help us to explore its time/space
complexity -- so that we may, at the very least, hint at its
performance. We have a few ideas on how we might compare the efficiency
of our algorithm with that of Ben-Dor's. Also, we continue to research
the properties of the Nearest Neighbour approach to DNA tag generation
and have a few suggestions that will help us to further refine our algorithm.
Random Tag Generation
One version of our implementation uses a purely random method to generate
tags. This method carries out random tag generation as follows...
The melting temperature calculation relies on Needleman-Wunsch alignment.
It is perfomed like this...
* J. SantaLucia, Jr., H.T. Allawi, and P.A. Seneviratne. Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability. (1996) Biochemistry (35) pp. 3555-3562.
Our melting temperature calculation is subject to two assumptions.
The first assumption inherent in this process is that identities found
between two distinct tags by the global alignment of two tags correspond
to physiological pairing between one tag and the other antitag. Another
assumption is that pairing happens without pseudoknots or loops whereby
the 3' end of the tag binds upstream of the 5' end.
Implementation of Random Tag Generation
We have successfully implemented an algorithm for producing pseudo-random
DNA tags. Our algorithm generates tags as expected and preliminary
results indicate that as more tags are generated, the rate of tag production
decreases. For higher allowed tag simliarities (higher melting temperature)
more tags are generated. Therefore, we have succeeded in implementing
an algorithm that produces tags using a more accurate temperature model
than prior work by Ben-Dor, et. al. (of course, they used the 2-4
rule). We would like to directly compare the performance of our tag
generator with that of Ben-Dor, et. al. However, this may
not be possible, as the temperature parameter in the Ben-Dor model does
not straightforwardly correlate to a melting temperature on the absolute
scale (degrees Kelvin). This is too substantial a difference for
direct comparison. Despite this concern, we have attempted to contact
Ben-Dor in order to obtain an implementation of his method and we are anticipating
the results from the Universal DNA Tag Generation group. Our thinking
is that we can use the results of the Ben-Dor algorithm as input for our
algorithm. By looking at the Ben-Dor tags that "survive" our method,
we will be able to comment on the relative performance of the two algorithms.
Other Approaches to Tag Generation
Although the random tag generation approach has produced reasonable
results, we felt we should examine other approaches and look for potential
improvements in tag generation. One such approach is the idea of
applying the group theoretic concept of "necklaces" to aid in efficient
DNA tag generation.
Necklaces
In search for a more systematic way of generating tags, we came across a class of combinatorial objects called necklaces. Our goal being to generate distinct tags, these objects seemed promising at first glance.
A k-ary necklace is an equivalence class of k-ary strings under rotation.
The illustration below (taken from [1]) shows the the six binary necklaces with 4 "beads" and their equivalence classes of strings.
In [2] it is shown that necklaces can be generated in constant amortized time by the so-called FKM- Algorithm by Frederickson, Kessler, and Maiorana. We implemented this algorithm in order to generate 4-ary necklaces, which can be regarded as a subset of DNA-Tags.
The number of k-ary necklaces of length n can be computed by the following
formula,
(x)
being the Euler totient function:
We can use this formula to get a sense of how many 4-ary necklaces
exist, compared to the total number of possible tags:
|
# of possible tags t | # of necklaces n | n/t | |
| 1 | 4 | 4 | 1 | |
| 2 | 16 | 10 | 0.625 | |
| 3 | 64 | 24 | 0.375 | |
| 4 | 256 | 70 | 0.2734375 | |
| 5 | 1024 | 208 | 0.203125 | |
| 6 | 4096 | 700 | 0.170898438 | |
| 7 | 16384 | 2344 | 0.143066406 | |
| 8 | 65536 | 8230 | 0.125579834 | |
| 9 | 262144 | 29144 | 0.111175537 | |
| 10 | 1048576 | 104968 | 0.100105286 | |
| 11 | 4194304 | 381304 | 0.090909958 | |
| 12 | 16777216 | 1398500 | 0.083357096 | |
| 13 | 67108864 | 5162224 | 0.076923132 | |
| 14 | 268435456 | 19175140 | 0.071432963 | |
| 15 | 1073741824 | 71582944 | 0.066666812 | |
| 16 | 4294967296 | 268439590 | 0.062500963 | |
| 17 | 17179869184 | 1010580544 | 0.05882353 | |
| 18 | 68719476736 | 3817763740 | 0.055555774 | |
| 19 | 274877906944 | 14467258264 | 0.052631579 | |
| 20 | 1099511627776 | 54975633976 | 0.050000048 | |
| 21 | 4398046511104 | 2.09431E+11 | 0.047619048 | |
| 22 | 17592186044416 | 7.99645E+11 | 0.045454556 | |
| 23 | 70368744177664 | 3.05951E+12 | 0.043478261 | |
| 24 | 281474976710656 | 1.17281E+13 | 0.041666669 |
For n=10, approximately 10% of all possible tags are necklaces. If they
were all distinct enough when evaluated with our implementation of the
Nearest Neighbor-Model, this number would be suffiently high.
Comparison of Random Tag Approach and Necklace Approach
After implementing both the random tag approach and the necklace approach,
we noticed immediately that the random method outperforms the necklace-based
method.
Because new necklaces are systematically generated from the previous
necklaces through slight modification, a new necklace often contains parts
that haven't been altered
and therefore match perfectly with their ancestors. Such a tag will
then of course be rejected. We believe this to be the main reason why the
necklace algorithm has a 'slow start' and never performs better than the
randomized algorithm: a) the necklaces are only a 10% fraction of
the space of possible tags and b) the differences between subsequent necklaces
are too small - much computational effort is wasted here. For future work,
we are focussing on improving this situation.
To illustrate the previous discussion, a graph is shown below,
plotting the number of found, valid tags ('tags accepted') against the
number of 'tries' needed ('generated tags') for both algorithms and various
critical Tmelt(). Note the smooth curve for the random algorithm and the
jagged, stair-like apperance of the necklace alg. Every time a 'step' is
reached there, the necklaces have changed enough to yield a new group
of tags - and after that, the necklaces are too similar again, so the number
of found tags is low until the next 'step'. Also important to note is the
logarithmic scale on both axis: although there seems to be only a slight
advantage for the random tag algorithm in the end, it actually yields about
2-3 times more tags than the necklaces.
Legend: The different lines are labelled using the system : "algorithm"-"tag
length"-"crit. Tmelt()".
Thus "random-10-20" means that this line shows the behaviour of the random
algorithm with a tag length of 10 bases and a critical Tmelt() of 20 degree
Celsius.
Continuing Work
There are several areas where we plan to direct our efforts.
First, we would like to refine our algorithm. The current implementation of the melting temperature function is extremely inefficient -- due to an extensive use of string operations. This did not hurt us in the beginning of the project, but as we increase the tag lengths and implement more complex algorithms, we need to optimize the Tmelt() function to gain as much speed as possible.
Then, we will be able to do a more through evaluation of the impact of different tag lengths on the performance of the two exisiting algorithms -- as well as for any new algorithm that we might create. Because the search- (and necklace-) space grows very rapidly as the tag length goes up, eventually, the necklace algorithm will no longer be able to cover it completely. However, a variation of this method might actually perform better there than the random method. are expected to change the relative
Early on, we abandoned the theoretical aspect of the project. It appears as though a theoretical proof of optimality or "near optimality" of our method will be difficult -- if, in fact, it is possible at all. Still, we plan to revisit the theoretical work and see if we can make any progress on the "road" to proof. Perhaps we will be able to at least hint in the direction of a proof of near optimality for one of our methods.
Also, we hope to obtain an implementation of the Ben-Dor algorithm (either from Ben-Dor himself, although so far we have not been able to correspond with him, or from the CPSC 536A group working on Universal DNA Tag Generation). With a working version of the Ben-Dor method, we could do some interesting comparisons of that algorithm with our own algorithms. For example, we would like to feed his results into our algorithms and see how many of his tags (if any) are disqualified by an implementation that uses the more accurate Nearest Neighbour Model of melting temperature. This would give us a better idea of the potential benefits of a more complete melting temperature model.
We also want to investigate the "quality" of our melting temperature estimate, because it is vital to all our algorithms - if this measure is erroneous, all of our results are. So we want to use actual melting temperature measurements as test data for our prediction algorithm, comparing the measured melting temperature with our estimated prediction. Unfortunately, the amount of published measurements we found so far is small - but we will continue to search.
Coming forth from the observations of the necklace algorithm and its apparent weaknesses, we have the idea of a 'tag mutation' algorithm, a 'genetic-like' algorithm that actually generates new tag candidates from old ones by 'mutation' of bases. These mutations could be random, but we mainly want to investigate 'guided' mutations, which are deliberately placed in areas where the tag matches well with other tags - hopefully destroying those matches and generating more distinct tags. This might yield a better method than pure random tries, especially if the number of tags required is large.
We have decided that future versions of our algorithms will not allow
the production of tag complements. In practice, there is a high probability
that tag complements in solution might bind to each other --
thus eliminating both tags. Therefore, tag complements are not really
"valid" tags. In an effort to produce the most effective set of tags
possible, we will disallow tag complements and prevent this problem.
References:
[1] The (Combinatorial) Object Server:
http://www.theory.csc.uvic.ca/~cos/inf/neck/NecklaceInfo.html
[2] F. Ruskey, C.D. Savage, and T. Wang, "Generating necklaces", J. Algorithms, 13 (1992) 414-430.[3] J. SantaLucia, Jr., H.T. Allawi, and P.A. Seneviratne. Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability. (1996) Biochemistry (35) pp. 3555-3562.
Updated March 6. 2001