Software Heritage: the software Library of Alexandria

Two people discuss about code displayed on a screen
The Software Heritage project aims to build a software Library of Alexandria: a perennial, universal source code archive to serve society, science, and industry.

In today’s information society, software is everywhere. It is at the heart of scientific research, technological developments, and ever more industrial processes. Software plays a pivotal role in the everyday life of our society. It gives us access to humanity’s knowledge and cultural heritage, of which it is also a part. However, software is fragile: it can be altered or made unusable.

Software Heritage is also a fantastic tool for examining software that is currently being developed, which should enable developers to build better programs.

“In order to preserve this heritage and to meet the technological and scientific challenges of tomorrow, it is essential to build a universal and perennial software archive today.” This is the ambitious objective of Software Heritage. Launched by the French Nation Institute for Research in Digital Science and Technology (Inria) in 2016, and carried out in partnership with UNESCO, this project aims to collect, preserve, and make accessible all software publicly available in source code format. Source code is text that details the instructions of a computer program in a programming language that is readable and usable by humans.

The concept of software heritage

A large portion of the information produced in the world is in digital format. Nowadays, in science, for example, scientific publication systems on the internet (electronic journals, open archive portals, blogs, etc.), side-by-side with paper format scientific publications and journals, play a crucial role in the dissemination of knowledge and the promotion of research.

In parallel, digital technologies are used more and more in the cultural sector for heritage preservation. Large-scale programs have been launched, aiming to bulk digitize the collections of museums, libraries, and archives, as well as historical monuments and sites.

Accessing, handling, and interpreting digital resources, no matter the physical medium they are stored on, requires software. In this sense, software is “the key mediator for accessing all Cultural Heritage”, the Rosetta Stone containing both the raw data and the means of converting the information.

Itself an expression of human creativity and intelligence, software is a piece of our heritage that should be protected and passed on to future generations. UNESCO has thus included it in the Charter on the Preservation of the Digital Heritage, which was adopted in 2003. In 2017, this UN agency signed an agreement with Inria relating to the preservation of and access to software source code, taking the Software Heritage project into a new dimension.

A journey back (and forward) in programming time

Anybody can use Software Heritage to find, study, improve, and reuse the code they need. “Our archive is an absolutely unique observatory of the development of the planet’s software”, states Roberto Di Cosmo, CEO of Software Heritage, in an interview given to “Usbek & Rica” magazine. Keeping track of thousands of programming languages, as well as any resources relating to program development (documentation, articles, comments left by programmers, etc.), is of huge interest to researchers specialized in the history of computing. The archive can be used as a basis for fundamental work, such as that carried out on the history of UNIX or the Apollo Guidance Computer source code.

Beyond being a “time machine” and an archive of past software, the facility makes it possible to examine software that is currently being developed. Developers can use it to build better programs.

A key building block of open science

Software Heritage wants to become the reference for software used in scientific research, thus reinforcing the free access and open data approach, which aims to make scientific data and publications freely available to all.

The reproducibility of an experiment’s results is a requirement of scientific method. It enables scientific validation of a piece of research. To guarantee this reproducibility, it is necessary to have access to the articles presenting the results and the research data, but also to the software used during the experiment. In effect, software is now used at every step of research and in all scientific fields. With Software Heritage, researchers can know exactly on which version of a piece of software the research is based.

Archiving, making accessible, preserving

A mammoth task, the archiving of all source codes – Software Heritage does not make a selection – relies on several mechanisms. Programmers and different bodies who possess what are known as “software artifacts” (any element produced during the development process) can deposit and reference source code themselves, attaching several files (software description, people to be credited, project license, metadata file to help with source code indexing). Internet users can participate in the project by submitting a request to save any “contemporary” source code that is not yet integrated in the archive, or that is not up to date. This complements the automatic exploration (crawling) carried out on the major code hosting platforms to continually discover “software origins” (locations identified by URLs from where a coherent set of source codes has been obtained). They can also contribute to the retrieval and organization of historical source code by following the Software Heritage Acquisition Process (SWHAP), developed in partnership with the University of Pisa.

Work on organizing and indexing the collections, as well as providing research tools within the archive, should enable users to find their way through all this software. A unique identifier is attributed to each software artifact, and it is possible to perform a search from the software metadata collected and extracted by the project. Users can write programs for navigating within the archive, using the Software Heritage API.

Finally, to guarantee the durability of the archives, several precautions have been taken. Today there are three copies of the archive, two are in the Inria’s datacenters, and one is in the Microsoft cloud in another country. The project members are also working on a network of international “mirrors”, full copies of the archive managed by other bodies that are totally independent of Inria. In this way, Software Heritage’s precious pieces of software should not face the same tragic fate as the Library of Alexandria documents.

According to Inria, to date Software Heritage has gathered “over twenty million software projects, two and a half billion unique archived source files and their entire development history”, which would already make it “the richest source code archive on the planet”.

Read also on Hello Future