Comma-Free Design For Dna Words

In the January 2003 issue of Communications, a data-storing scheme that uses the DNA bases Adenine, Cytosine, Guanine, and Thymine (A, C, G, T) was demonstrated for several bacterial genomes [6]. To render this method practical and scalable, its data-coding procedure requires a more sophisticated design: first, sentinel sequences (DNA sequences marking the beginning and end of introduced sequences that are not innate to the host genome) must uniquely flank the information-bearing sequences. Thus, sentinels should not appear in the data region, irrespective of its contents. Multiple sentinels may be necessary as the amount of data increases because the DNA fragment length that can be safely embedded and maintained by current laboratory techniques is at most a few hundred base-pairs for each site. Second, artificial sequences for inscribing data require error-detecting mechanisms because DNA fragments retrieved from large genomes often contain errors like base substitutions and frame shifts. Since DNA sequences have no commas or spaces, frame synchronization, a property referred to as comma-freeness, is ideal for DNA code words [3].

Clearly, assigning all DNA-base triplets (4³=64 patterns from AAA to TTT) to some data characters (for example, ASCII), is not appropriate. For accurate information decoding, DNA words are designed as a comma-free, error-correcting code, thus they, their arbitrary concatenations and Watson-Crick complements (the opposite-directed strand in the DNA double helix) must be separated by a certain minimum distance.

While no systematic construction has been developed for a binary comma-free, error-correcting code of large minimum distance, for the quaternary DNA sequences there exists a simple construction, the template method [2]. In it, sequences are designed as a product of one binary template, denoted T, and any traditional error-correcting code of minimum distance d (d: positive integer). The template T is chosen so that it induces at least d mismatches between itself and subsequent patterns:

T^R TT^R T^RT TT T^RT^R

Here T^R denotes the reverse of T and is introduced to consider mismatches against complementary sequences. The resulting quaternary code inherits the properties of both binary words: any word pair induces at least d mismatches without block shift because of the error-correcting code, and with block shift or reversal because of the chosen template.

Consider the design of length-6 DNA code words. The template ‘110100’ induces at least two mismatches between itself and its concatenated patterns in all frames (see Figure 1). Its product with an error-correcting code of minimum distance 2 produces the DNA code words that keep this distance regardless of their concatenation (see Figure 2). A set of 32 length-6 words of minimum distance 2, constructed by adding one parity bit to all 5-bit patterns, produces 32 DNA code words—enough for the English alphabet.

The use of a fixed template yields additional biological advantages. A DNA molecule forms a double helix with its complementary strand with hydrogen bonds. Since the rate of this process (annealing) is governed by thermodynamics, the distribution of GC bases along the DNA strand determines the temperature at which annealing occurs (melting temperature) [5]. Therefore, when one template specifies the GC positions in all DNA code words, their melting temperatures become uniform, forcing uniform physical behavior on all code words in an experiment. The uniform GC arrangement also helps to distinguish information-bearing from genomic DNA.

Minimum distance and uniform melting temperature are not the only constraints that DNA code words should satisfy. Since annealing is essentially a physical, probabilistic process, unwanted annealing and subsequent DNA editing can occur at a stretch of completely matching DNA base-pairs. To avoid this, no subword of length-7 or 8 should appear in the designed words and their concatenations more than once. In the template method, this constraint is separately assigned to the template and the error-correcting code words. Although this constraint severely limits the word length (and consequently, the number of words we can design), we were able to select 112 words of length-12 that keep minimum distance 4 and never share any subword of length-7, no matter how they and their reverse complements are concatenated [1]. However, due to the limited DNA fragment-length that can be embedded in genomes, the amount of information we can insert with current laboratory techniques is restricted.

The design described here was introduced in the DNA-computation project under the auspices of the Japan Ministry of Education, Culture, Sports, Science and Technology and the Japan Science and Technology Corporation. In the field of DNA computation, different code design schemes have been demonstrated and evaluated in biological experiments [4]. Wong et al. [6] reconfirm the effectiveness of biomolecules for information processing, and the applicability of sequence design to broader areas such as DNA data storage and DNA steganography.

Figures

Figure 1. Base mismatches for Template T = 110100. Each frame contains at least two mismatches. The figure shows the mismatches between T^R and TT, but the two mismatches are guaranteed for any concatenation including its reverse pattern (T^R=001011).

Figure 2. DNA Sequence Design. The template specifies the GC/AT positions in DNA code words. DNA bases corresponding to bit OS in the code words are shown in capital letters.