The world’s digital data could be stored as DNA in a coffee cup

new technique for data retrieval from DNA storage

Date: 15th June 2021

 

Today there is currently about 10 trillion gigabytes of digital data and we are currently producing around another 2.5 million gigabytes of data daily.  Exabytes data centres (an exabyte is 1 billion gigabytes) are the heart of web hosting, dedicated server hosting, email services as well as many other services and are places to store, manage, and disseminates all of that data.  However, these centres are incredibly expensive to maintain and build, ~$1 billion, and also take up vast amount of space. Scientists have for decades believed that DNA digital data storage may be the solution, as DNA has enormous potential for a storage medium due to its high storage density.  However, practically its use is currently limited by high DNA synthesis costs, slow read and write times, they are also energetically intensive and lead to data loss over time.  Now, scientists have developed a new technique to label and retrieve DNA data files, allowing them to random access archival files in a large-scale molecular dataset.

The concept of DNA digital data storage dates back to 1959, when the physicist Richard Feynman outlined the general prospects for writing the entire 24 volumes of the Encyclopaedia Brittanica on the head of a pin.  He was inspired by the biological example of writing information on a small scale that occurred in cells. It wasn’t until 1988 that a collaboration between an artist, Joe Davis, and researchers from Harvard, saw an image stored in the DNA sequence of E.coli.  Since those early days major technical advances have been made in the field, and now gigabytes have been encoded and decoded into binary data, to and from synthesised strands of DNA.  However, how to retrieve the wanted data amongst the plethora of DNA is currently limiting, and there is an unmet need for new approaches.

Now, biological engineers at Massachusetts Institute of Technology (MIT), US, led by Mark Bathe, have encapsulated data-encoding DNA file sequences within impervious silica capsules that are surface labelled with single-stranded DNA barcodes, allowing storage, physical sorting and random access of data from DNA with minimal loss.

Currently, DNA files are retrieved using PCR (polymerase chain reaction), using sequences the binds to specific PCR primers.  However, crosstalk between primers and off-target sequences, can lead to unwanted files to be recovered.  Furthermore, the process requires enzymes which end up depleting most of the DNA that is in the pool.

To address this limitation, the MIT team developed a new retrieval technique.  They encapsulated each data-encoding DNA file into an impervious silica capsules, this was labelled with single-stranded DNA barcodes that corresponded to the contents of the file.  The primers were labelled with fluorescent or magnetic particles, such that the correct files could easily matched and retrieved, leaving the remaining DNA intact to be returned into storage.  The retrieval process also allowed Boolen logic statements such as ‘AND’ facilitating the selection of sets of files.

As proof-of-concept, the team encoded 20 different images into pieces of DNA about 3,000 nucleotides long, equivalent to about 100 bytes.  However, they also demonstrated that the capsules could fit 1 gigabyte size files.  The system had a search rate of 1 kilobyte per second, and could accurately pull out the individual images stored.  As the labelling primers could be multiplexed, the approach could be scaled up to 1020 files.

Conclusions and future applications

The team here have demonstrated a new strategy for random access of archival files within large-scale molecular datasets, offering us a potential way to ‘google’ search data that is stored within DNA.

Pioneering experts in the field such as George Church, a professor of genetics at Harvard Medical School, described the technique as “a giant leap for knowledge management and search tech.”

With advances in informatics, artificial intelligence, the Internet of Things, and medical testing, such as we’ve seen with the recent collection of DNA and RNA samples from COVID-19 testing, or ongoing efforts in human genome sequencing and genomics there is already a need for low-cost, massive storage solutions.  However, whilst the team have addressed one aspect here, the other  – cost  – is still prohibitive at this time.  However, this is likely to fall in the next few decades, as technical advances are made.

With this in mind, Bathe envisages this kind of DNA encapsulation storage would be ideally placed for storing ‘cold’ data, information that requires archiving and is infrequently used.  Bathe is founder of a spin out startup company, CacheDNA, whose mission is to provide a low-cost platform to store nucleic acids that are mission-critical to a number of areas including viral detection, ecological conservation, forensic analysis, and massive DNA-based file systems for archival data storage.

Other industry giants have also entered the DNA storage arena.  Microsoft and colleagues at the University of Washington have developed the first full end-to-end automated DNA storage device.  Consisting of three core components for the write-store-read process; including an encode/decode software module, a DNA synthesis module, and a DNA preparation and sequencing module.  Although the initial data retrieval was slow, the system successfully stored and extracted the simple message “Hello” in 21 hours.  Advances such as described here should further accelerate the retrieval efficiency and time.  Others are using DNA punch cards, a macromolecular storage mechanism in which data is written in the form of nicks at predetermined positions on the backbone of native double-stranded DNA.

To understand the scale of how DNA storage could provide a solution to the enormous and expanding quantity of digital data that need to be maintained, the theory is that a coffee mug full of DNA could theoretically store all of the world’s data!

 

For more information please see the press release from MIT

 

Banal, J.L., Shepherd, T.R., Berleant, J., Huang, H., Reyes, M., Ackerman, C.M., Blainey, P.C., and Bathe, M. (2021). Random access DNA memory using Boolean search in an archival file storage system. Nature Materials.

https://doi.org/10.1038/s41563-021-01021-3