Cracking the Code on DNA Storage

Cracking the Code on DNA Storage, illustrative photo

One of the remarkable ironies of digital technology is that every step forward creates new challenges for storing and managing data. In the analog world, a piece of paper or a photograph never becomes obsolete, but it deteriorates and eventually disintegrates. In the digital world, bits and bytes theoretically last forever, but the underlying media—floppy disks, tapes, and drive formats, as well as the codecs used to play audio and video files—become obsolete, usually within a few decades. Once the machine or media is outdated, it is difficult, if not impossible, to access, retrieve, or view the file.

“Digital obsolescence is a very real problem,” observes Yaniv Erlich, assistant professor of computer science at Columbia University and a core member of the New York Genome Center. “There is a constant need to migrate to new technologies that don’t always support the old technologies.”

The challenges don’t stop there. Current storage technologies—even state-of-the-art flash drives—require significant space and devour incredible amounts of energy to operate. Ultimately, there’s a need to “find a better way to store data” while addressing these and other issues, Erlich says.

As a result, researchers are looking for more efficient ways to store data from books, movies, and myriad other digital file formats. One of the most promising emerging candidates: DNA storage. The technology uses synthetically produced DNA and a “printing” or electrochemical assay process to capture data in strings of synthetically produced genetic code.

Unlike existing media and even other emerging technologies such as holographic and three-dimensional (3D) storage, DNA can withstand huge variations in temperature, along with some variation in moisture. This makes it theoretically possible for the data to last tens of thousands of years, and even to withstand a global disaster.

The scale of the technology also introduces remarkable possibilities. At present, a gram of DNA holds about 730 million megabytes of data. “You could potentially fit a datacenter in DNA material the size of a sugar cube,” states Georg Seelig, an associate professor in the department of electrical engineering at the University of Washington. Already, Microsoft Research and Technicolor are eyeing DNA storage, while researchers are inching closer to developing a commercially viable storage technology based on DNA.

Observes George Church, professor of genetics at Harvard Medical School and a pioneer in the field of DNA storage: “It’s a fast-moving space full of uncertainty, but there is a great deal of promise in the technology.”

Beyond Biology

Although the idea of using DNA to store data extends back to the mid-1960s, it wasn’t until 2012 that the concept began to take shape in any tangible way. Then, Church and a team at Harvard University figured out how to convert digital 1s and 0s into long strings on four different nucleotides, also referred to as bases, comprised of As, Gs, Cs, and Ts (for adenine, guanine, cytosine, and thymine). They ultimately encoded 70 billion copies of a 53,400-word book, as well as JPG images and a JavaScript program. The team made multiple copies of all the data to test the accuracy and capacity of the storage medium. In the end, they managed to store about six petabits, or about a million gigabytes of data, within each cubic millimeter of DNA. The project demonstrated the validity and potential of the technology.

Over the last few years, various other researchers have worked to advance DNA storage and they have made important breakthroughs. For example, in 2016, a research team from Microsoft and the University of Washington reported that it had written approximately 200 megabytes of data, including War and Peace and 99 other literary classics, into DNA. Early this year, Erlich and a research team pushed the boundaries further by encoding six separate files into a single DNA file: a French film (the 49-second Arrival of a train at La Ciotat, from 1895, considered the first motion picture in modern history); a complete computer operating system called KolibriOS; a $50 Amazon gift card; a computer virus known as Zip Bomb; the contents of the plaque carried on the Pioneer spacecraft; and a research paper. They chose the mix of files because they were more prone to “highly sensitive errors,” he says.

What was remarkable about Erlich’s research was the ability to push closer to the limit of the so-called Shannon Information Capacity—which determines how much information can be fit into a unit of a system. Previously, researchers had been able to place about 0.9 bits per nucleotide; the Columbia University researchers pushed the figure to about 1.6. Altogether, the team achieved a storage density of about 215 petabytes per gram of DNA, while improving the speed to read data and reducing the cost by roughly 60%. “We were able to achieve far greater robustness and reliability within the storage mechanism,” he explains.

The key technical challenge is that not all DNA sequences are created equal, and the efficiency and accuracy of the coding process can vary based on the sequencing pattern.

One of the keys to success was that the team tapped an advanced algorithm that significantly improved error correction. A problem with DNA storage is that the enzymes used for copying data aren’t perfect; the DNA deteriorates during the copying process—and errors wind up in the code. “If you think about regular storage devices, such as a hard drive or a flash drive, this is completely unacceptable,” Erlich says. “You want to be able to read the file tens of thousands of times.” However, after copying the original sample file 25 times and then copying the copy nine more times (essentially reaching a factor of 10 to the power of 15 copies), the team found that while the results were noisier and more error-prone, they could use an error correction method to view the data accurately.

The group conducted one last experiment that further supported the viability of the technology. Researchers took the DNA molecules that contained the data and used a dilution method to damage the DNA sample. The result? Using the algorithm, which works like a Sudoku puzzle by tapping a few cells to create hints about the content of the file without actually sending the file, “We could still perfectly retrieve the information, and we showed that we can get to an information density of 215 petabytes per 1 gram of DNA. To the best of our knowledge, this is probably the densest human-made storage device ever created.”

Forward Thinking

Although researchers are pushing DNA storage forward, numerous challenges remain. The biggest obstacle right now is the price tag, which works out to about to about $3,500 per megabyte of storage. Even with researchers at Columbia University reducing DNA storage costs by about 60%, the figure would probably have to reach about a 600% improvement to tip the scale to a commercially viable solution.

“Right now, cost is an enormous roadblock,” states Sri Kosuri, an assistant professor of chemistry and biochemistry at the University of California, Los Angeles (UCLA), and a member of Church’s 2012 team. Currently, Kosuri notes, most DNA research and funding are focused on genome and biomedical research, which has enormous and growing commercial viability. Yet, “There’s no reason the price couldn’t drop. It’s a question of scaling and advancing the technology to make it viable.”

The key technical challenge, which factors directly into the cost issue, is that not all DNA sequences are created equal, and the efficiency and accuracy of the encoding process can vary based on the sequencing pattern. For example, DNA sequences with high GC content or long homopolymer runs (such as AAAAAA… coding) are difficult to synthesize and more prone to sequencing errors. At present, this requires rewriting the DNA code in a different way. One way for researchers to potentially reduce latency and costs that might result from this problem is to produce DNA material faster, but at lower quality. Seelig describes this as “quantity over quality.”

Already, algorithms such as the one Erlich used can sidestep many potential problems through error correction. In the future, better DNA printing systems, along with improved error-correction algorithms, will almost certainly drive down the cost further. Ultimately, “The more information you can store in fewer copies of a DNA molecule without having to add much logic redundancy, the better your storage becomes,” Seelig points out. In the future, this could also mean that the capacity of a gram of DNA could be reduced from 200 petabytes of material to, say, two petabytes per gram. “At that point, it becomes an economical feasible technology,” Erlich says.

Larger questions also loom about how the technology might play out in a practical sense. Reinhard Heckel, a postdoctoral researcher at U.C. Berkeley, says current DNA storage appears to be limited to archival solutions. “Because it doesn’t look like it will write or read very fast, it’s more likely to serve as a form of archival storage,” he says. For the foreseeable future, “It will probably be best suited for information that must be stored for decades and for preserving the important information of the world, sort of like a global seed vault.” There are also questions about whether future computing frameworks—which could delve into organic or molecular models—would be fully compatible with DNA storage systems.

Still, the technology is marching forward. DNA storage could introduce new and intriguing possibilities over the next decade, including radical advances in cloud storage and far more versatile ways to store data, experts say. “Microsoft and other companies are interested in this space simply because the volume of data is growing exponentially. There’s a need to develop better ways to store massive amounts of data, particularly over long time periods,” Seelig says.

The use of DNA to store data might also unleash new and radically different possibilities that extend beyond conventional storage arrays. This includes loading data into food and medicine, and possibly combining DNA storage with molecular computing and synthetic biology. Because synthetic DNA molecules are extraordinary small, some intriguing possibilities emerge through combining DNA storage with computational systems.

In fact, no one is exactly sure where the research in DNA storage technology will lead. Concludes Seelig: “The intersection of molecular science and computing models is remarkable because DNA molecules can be programmed to act like simple computers that can sense and respond to their environment. They also open up the possibility of quickly and efficiently searching for information stored in a vast DNA archive.”

Further Reading

Bornholt, J., Ceze, L., Carmean, D.M., Lopez, R., Seelig, G., and Strauss, K.
A DNA-Based Archival Storage System, ASPLOS 2016, April 1, 2016. https://www.microsoft.com/en-us/research/publication/dna-based-archival-storage-system/.

Blawat, M., Gaedki, K., Hütter, I., Chen, X.M., Turczyk, B., Inverso, S., Pruitt, B.W., and Church, G.M.
Forward Error Correction for DNA Data Storage, Procedia Computer Science, Volume 80, 2016, Pages 1011–1022.

Zhirnov, V., Zadegan, R.M., Sandhu, G.S., Church, G.M., and Hughes, W.L.
Nucleic acid memory, Nature Materials, Vol. 15, April 2016. www.nature.com/naturematerials.

Erlich, Y. and Zielinski, D.
DNA Fountain enables a robust and efficient storage architecture, Science, 03 Mar 2017: Vol. 355, Issue 6328, pp. 950–954. DOI: 10.1126/science.aaj2038. http://science.sciencemag.org/content/355/6328/950.

Beyond Biology

Forward Thinking

Cracking the Code on DNA Storage

DOI

July 2017 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Beyond Biology

Forward Thinking

Cracking the Code on DNA Storage

DOI

July 2017 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.