DNA punch cards pave way for low cost data storage

DNA punch cards store data

Date: 14th April 2020

The ability to store information in DNA-based data storage systems is becoming an increasingly attractive model as there has been an explosion in the amounts of data being generated by digital applications. Now scientists introduce DNA punch cards, a macromolecular storage mechanism in which data is written in the form of nicks at predetermined positions on the backbone of native double-stranded DNA.

The promise of ultrahigh storage density and long-term stability when utilising DNA in data storage is currently offset by high costs, read-write latency (time to produce) and inherent error-rates. Current DNA-based data recording designs store user content in synthetic DNA oligos, and retrieve the desired information via next-generation sequencing technologies, which have limitations. The necessity to synthesise new DNA for digital storage is currently still a slow and expensive process.

To address current limitations, therefore, and to pave the way towards future low-cost molecular storage solutions, a team of scientists led by Olgica Milenkovic from the University of Illinois at Urbana-Champaign, US, wanted to test the use of readily available native DNA rather than a synthetic equivalent (oligos).

By using native DNA the team explored an approach to modify the DNA topology to encode information, rather than encoding the data within the DNA sequence itself, by marking existing DNA molecules with patterns of “nicks” to encode data – a method inspired by punch cards used to store information for many early computers.

DNA register

So let’s start with the DNA register as the team coin it – the DNA punch card or template. In theory, virtually any DNA template could be used, but here the team used genomic DNA extracted from a culture of E. coli, as it is cheap and readily available. They determine that the optimal size for each register should be 450bps, which would represent a template containing between five and ten nicking sites and one that was a useful length for post-sequencing. The team initially chose 5 candidate registers which were amplified by PCR from E. coli genomic DNA. Each register was associated with a particular set of intra-nicking fragment lengths that were expected to arise upon the completion of the nicking reactions, so each resister contained n designated nicking positions.

To perform the actual encoding, user files were parsed into n-bit strings which were converted into positional coding – those sites to be nicked and those to be left in their ground-state (not nicked) on the arranged registers.  The rules set by the team were that a 1 corresponded to a nick while a 0 corresponded to the absence of a nick. By combining strings of information the complexity of the data could then be incorporated as a binary message into a positional code.  As an example, the string 0110000100 was converted into the positional code 238, indicating that nicking needed to be performed at the 2nd, 3rd, and 8th positions.

Punch tool

So how are these files punched into the registers?  In this case the team used the highly accurate artificial restriction enzyme Pyrococcus furiosus Argonaute (PfAgo) as their “punch” tool.  Under appropriate reaction conditions, the enzyme could perform single stranded ‘nicks’ at target sites, defined by single 16 nt DNA guides (gDNA).  Each register was nicked by mixing it with a combination of nicking enzymes that contained guides matched to the collection of sites to be nicked and the information content to be stored.  The platform accommodated parallel nicking on orthogonal DNA fragments which allowed fast and efficient data recording on a library of registers.

File extraction

With the data successfully stored how is it then read?  The nicked registers were first denatured, resulting in ssDNA (single strand) fragments of variable lengths dictated by the nicked positions. These were then converted into a dsDNA library which were subsequently sequenced.  The reads were then aligned to the known reference register sequence for deciphering.

As a proof-of-concept the team were able to store and retrieve a file size of 14.4kb, this contained an image and text file of Lincoln’s Gettysburg address and memorial image, created from a combinatorial library of 1,024 registers.

Conclusion and future applications:

Here the team have designed a DNA-based storage system that mitigates the use of long synthetic DNA strands for storing user information, and recorded the data in the DNA backbone rather than in its sequence content.  This was highly efficiently, and significantly reduced the error rates compared to other synthetic platforms.

However, the authors do note that whilst many optimisations can be performed on the technique to allow for cost-efficient scaling, its current storage density capacity does fall short of current systems.  In all other performance categories the native DNA punch card outperformed, however, as cost is a primary limiting factor rather than information density this latest platform should be a real step forward in DNA-data storage.

With the likes of Microsoft having a vested interest in DNA storage devices, having developed the first full end-to-end automated DNA storage device with colleagues at the University of Washington, questions surrounding scale, cost and efficiency are starting to be answered. However, whilst the potential remains high, we still appear to be some way off storing the vast amounts of data required for these platforms yet to be real answer to our data storage predicaments.

 

Tabatabaei, S. K., B. Wang, N. B. M. Athreya, B. Enghiad, A. G. Hernandez, C. J. Fields, J.-P. Leburton, D. Soloveichik, H. Zhao and O. Milenkovic (2020). “DNA punch cards for storing data on native DNA sequences via enzymatic nicking.” Nature Communications 11(1): 1742.

https://doi.org/10.1038/s41467-020-15588-z