“The global data sphere, meaning the whole of digital data created by humanity, could fit into a van.”
Faced with the exponential growth of the volume of data, scientists are exploring more long-term and better performing storage solutions than those offered by current technologies. Molecular media are a serious avenue, and DNA in particular. Compact and sustainable, it could initially be used for archiving.
The molecule that supports the genetic heritage of living organisms, deoxyribonucleic acid (DNA) contains all the information necessary for an organism to function. Since the works of American physicist Richard Feynman in the late 1950s, scientists have also been planning to use it as the support for human knowledge.
In theory, it is quite simple: it’s a matter of coding digital data in the form of letters: A (adenine), C (cytosine), G (guanine) and T (thymine), which are the four nucleotides that make up DNA.
However, in practice, although the proof of concept of storage on DNA has been established, several challenges need to be met before this approach becomes economically viable.
Phenomenal density and sustainability
The volume of digital data generated on a global scale is growing exponentially. What’s more, current electronic storage technologies are not very long-lasting as they are based on physical media (magnetic tape, hard drives, compact discs, USB keys, etc.) that have a limited lifespan. The data saved must therefore be copied regularly – every five to ten years depending on the medium – in order to guarantee their integrity.
In comparison, DNA is extremely compact and sustainable. According to the National Academy of Technologies of France, the information density of DNA is ten million times superior to that of the best traditional systems. Taking into account various density-loss factors, the institute estimates that the “global data sphere”, meaning the whole of digital data created by humanity, could fit into … a van.
As for the lifespan of DNA, this is roughly 10,000 times greater than that of traditional media, and its conservation does not require cooling as DNA is stable at ordinary temperatures: this storage method is therefore low energy consumption.
Translating computer language into the language of living organisms
The process can be broken down into five steps: coding the data, writing on artificial DNA, storage of the DNA, reading it, and decoding the information (for a detailed description of these various steps and the technologies involved, see the October 2020 report published by the National Academy of Technologies of France).
The computer file – a text, image, sound or video file – is first converted into strings of 0s and 1s, then into strings of A, C, G and T nucleotides (coding). These nucleotides are then synthesized as DNA fragments (writing) either chemically or using enzymes.
The synthesized DNA can then be kept at room temperature thanks to physical or chemical storage systems that protect it from water, oxygen and light.
With chemical storage, the DNA is embedded in silica nanobeads with a conservation time of several decades. With physical storage, the DNA is stored in stainless steel capsules with an estimated lifespan of over 50,000 years. Some scientists also plan to store files on “in vivo” DNA within the genome of bacteria. In 2017, a team of researchers from the Harvard Medical School Department of Genetics encoded a short motion picture into E.coli bacteria using the CRISPR-Cas system.
Finally, in order to access the data contained in the DNA it will first be necessary to sequence it, meaning to determine the sequence order of the nucleotides within the different fragments (reading), then transcribe these sequences into the bit format of the original sequence (decoding).
High costs and slow reading and writing speeds
Although it has huge potential, storage on DNA remains experimental. Several obstacles still need to be overcome before this technology can really rival conventional electronic systems and be marketable on a large scale. The two main limiting factors are its high costs and the slow speed of the data writing and reading process.
The National Academy of Technologies of France states that currently the cost and time required for storing 1 GB (109 B) of data on DNA are comparable to those necessary for storing 1 PB (1015 B) of data on a computer. For example, current systems synthesize DNA one nucleotide at a time at a rate of 30 seconds per nucleotide.
The error rate when reading data is another brake. As this rate increases with the number of nucleotides per DNA fragment, it limits their length.
However, considerable progress has been made over the past few years in all of these areas. Several experts believe, for example, that thanks to the technological breakthroughs of the coming years, the cost of storage on DNA could be reduced enough by the mid 2020s for it to become viable on a large scale.
The process developed by American startup Catalog DNA (see insert) already makes it possible to reduce the costs of DNA synthesis thanks to a library of pre-synthesized DNA fragments.
Storing “cold” data on DNA discs
Because of the slowness of the reading and writing processes, experts agree that the use of DNA will initially be confined to the long-term archiving of what is known as “cold” data, meaning data that is rarely accessed. According to Marc Antonini, a research professor at the French National Centre for Scientific Research (CNRS), “this stock [of cold data] grows by 60% each year, while the storage capacity of current systems only improves by 20% […]”.
The development of alternative archiving methods therefore meets the increasing needs of various organizations: cultural institutions (film libraries, museums, libraries, etc.), research laboratories (specialized in particle physics for example) or even administrations and banks.
“Storage on DNA will be in competition or complementary to magnetic tape, the current preferred solution for long-term archiving”, stresses the National Academy of Technologies of France. For example, Marc Antonini and his team are working on OligoArchive. The aim of this project, which is financed by the European Commission and brings together several French and British bodies, “is to build a DNA disk: a fully functional end-to-end prototype demonstrating that DNA could one day replace current archival storage technologies on magnetic tape”.
Two promising DNA storage projects
In 2019, Microsoft and the University of Washington (UW) revealed the first fully automated system for storing data in DNA. To date, the research team has managed to store a record 1 GB of data.
The same year, Catalog DNA, an American startup coming out of the MIT, coded the whole set of Wikipedia pages in English, that is 16 GB of data. To accomplish this exploit, the scientists designed a process that is faster and cheaper than current processes. Instead of synthesizing the DNA one nucleotide at a time, they use fragments of DNA called “components”, that have already been synthesized. These components, taken from a “catalog”, are then combined by a machine to form longer DNA molecules, a bit like a printing press would.
Unlike the Microsoft and UW prototype, the process developed by Catalog DNA is destined to rapidly become economically viable.