Artificial intelligence | Article

Generative AI: a new approach to overcome data scarcity

Monday 21st of March 2022

Reading time: 4 min

In the context of the proliferation of hyper-realistic fake content for malicious purposes, generative artificial intelligence (AI) technology also promises significant progress, particularly in the medical field. It can help solve problems related to sampling bias or data scarcity, a critical bottleneck in machine learning.

“According to the consulting firm Gartner, more than 30% of new drugs and materials will be discovered using generative AI techniques by 2025.”

Generative AI algorithms can generate similar original content from existing content (images, audio files, text). There are several models. The most popular are generative adversarial networks (GAN), a class of unsupervised learning algorithms that pits two neural networks. The “generator” produces artificial data similar to the input data as realistically as possible. The “discriminator” then tries to distinguish between authentic and original data. After each test, depending on the result, the generator adjusts its parameters to create more convincing data, until the discriminator, which also improves with each iteration, can no longer distinguish between true and false.

Rather than a carbon copy of a painting, a GAN can thus create a credible new work in the style of the original. The “Meet the Ganimals” project, launched by the Massachusetts Institute of Technology in Boston (MIT), creates photorealistic images of hybrid animals, illustrating this ability to create new data from scratch, known as “synthetic data”. The performance of machine learning algorithms is generally correlated to the amount of data that constitutes their raw material. In some situations where such data is scarce, the use of synthetic data can increase the amount of data in a training set (known as data augmentation) or alter it.

Synthetic brain MRI

Medicine is one of those areas where data is not widely available, due to its rarity – medical images with abnormal findings are by definition infrequent – and the legal restrictions on the use and sharing of patient records.

In 2018, in the United States, researchers from Nvidia, the Mayo Clinic and the MGH & BWH Center for Clinical Data Science developed a model capable of producing synthetic brain MRIs showing tumours, which can be used to train a deep learning model. The research team believes that these synthetic images are both a complementary tool for data augmentation and an effective method of anonymization. They provide a low-cost source of diverse data, which has improved the performance of tumour segmentation (the process of distinguishing tumour tissue from normal brain tissue on an MRI scan) while allowing data sharing between different institutions.

Accelerated drug development

Pharmacology could also benefit from this approach. Designing a new drug is difficult, expensive and time-consuming: it typically takes more than twelve years and an average of one billion euros for a market launch. One of the reasons the cost is so high is that the synthesis of thousands of molecules is necessary before a pre-clinical study is started, in order to identify one candidate. This process requires the use of multi-objective optimisation methods to explore a vast “chemical space” (a virtually infinite expanse containing all possible molecules and chemical compounds), as the AI system must evaluate and make decisions related to several key criteria such as the drug’s activity, its toxicity or the ease with which it can be synthesised. The optimisation methods in question require a large amount of training data, which can in part be provided by generative models.

Insilico Medicine has created the Chemistry42 platform, which combines generative algorithms and reinforcement learning to automatically find new molecular structures with the desired properties within a few days. This is called “de novo” molecular design. This platform has been used by the biotech company, in combination with other tools, in several therapeutic areas, such as pulmonary diseases. In 2021, Insilico announced that it had identified a new therapeutic target (the part of the body, such as a protein, on which the drug will act) and a new molecule for a drug against idiopathic pulmonary fibrosis (IPF). The discovery, presented as a world first, took less than 18 months with a budget that was 10% of the cost of a conventional study.

According to the consulting firm Gartner, more than 30% of new drugs and materials will be discovered using generative AI techniques by 2025.

Synthetic faces

Sampling bias is one of the criticisms of facial recognition technologies. Some of these tools identify darker skinned people less often than lighter skinned people, or women less often than men. These documented biases, often related to the under-representation of certain groups in training databases, can lead to discrimination against a part of the population.

To avoid sampling bias, AI engineers need data sets that are representative of the population’s diversity. However, these sets are rare and the use of those that do exist is restricted due to the sensitive nature of biometric data.

Synthetic data can help to limit sampling bias. The use of real faces is still necessary at the beginning of the chain to train the generative model. It is then up to the designers to balance the dataset by granularly controlling the generation of synthetic data according to different attributes (gender, age, skin colour, etc.).

Another benefit of synthetic data is it overcomes constraints linked to the confidentiality of sensitive data and reduces the risks of interference. Generative models produce data that is realistic but artificial, unrelated to real people. Several studies have sought to show that synthetic data can be of just as much use as authentic data while protecting the individuals’ privacy (here using a public e-bike ride-share data feed).

Companies such as Datagen or Synthesis AI have specialized in the provision of synthetic faces. In Switzerland, the SAFER project, carried out by the Idiap Research Institute and involving the University of Zurich and SICPA, aims to create representative databases using synthetic faces that will be used to feed “ethical facial recognition” tools.