In 2020, the HUGO GCS (groupement de coopération sanitaire — health coordination group), which brings together several university hospitals in the Grand-Ouest region of France, announced the launch of Europe’s first interregional hospital data platform. Four health data research projects were created around this platform. One of them, called HUGO-RD/TAXY, aims to limit diagnostic error in cases with rare genetic diseases. In order to achieve this objective, a multidisciplinary team made up of engineers and physicians from academia (University Hospital of Rennes and University of Rennes 1) and industry (b<>com and Orange) was set up.
Decrypting Physicians’ Notes
When a disease has no discernible causes, physicians still face a number of pitfalls when carrying out testing, due to the current state of science and despite the existence of whole-genome sequencing. Identifying gene variants is complex, especially since some of them are nonpathogenic. As a result, half of the rare genetic diseases found to date have no known genetic causes, which can lead patients and their families to experience the distress of diagnostic error.
Clinical notes, written in free form during genetic consultations, serve as a source of information and analysis.
One solution seeks to use information about patients’ phenotypes, i.e. the observable characteristics of their genetic make-up, to filter out the most likely pathogenic variants for analysis. This is easier said than done. According to Orange Data Scientist Thomas Labbé and Orange Research Engineer Jean-Michel Sanner, who each contribute their natural language processing expertise to the TAXY project, “There is no systematic or standardized process by which all of a patient’s phenotypes could be recorded. The information we need is found in clinical notes, written in free form during genetic consultations. The idea is to identify groups of words in these reports that refer to standard phenotypic terms. But using string matching is not enough, because a lot of information in physicians’ writing is implied. The challenge therefore lies in making the implicit explicit, which involves interpreting abstract concepts that are inherent in language, or in other words, precisely understanding the semantics (meaning) of the text.”
Pretrained and Custom Models
In this respect, the project teams harnessed highly sophisticated pretrained language models, which allow for semantic similarity calculations. These models follow a principle of ‘transfer learning,’ meaning they can be customized to fit many use cases (translation, text classification, etc.), ranging from the simplest to the most complex, as in the case of the TAXY project.
This customization takes place in two stages. The first step, which is referred to as unsupervised, consists of feeding the system with around 20,000 unannotated reports to adapt the model to the target domain. Then there is the supervised learning mode. “For this second step, annotated reports will make up our training dataset. With this in mind, we have developed a dedicated annotation tool, called ACUITEE, which we published as open-source software so that it can be widely adopted by the geneticist community,” explains Majd Saleh, R&D Engineer at b<>com.
As Paul Rollier, Physician at the University Hospital of Rennes, explains, “With this cutting-edge annotation software, it will be possible to build a database of annotated reports that is consistent enough to fine-tune the language model for its intended purpose.”
Highly Attentive Networks
Once trained, AI can automatically extract the phenotypes described in the clinical notes. This job is assigned to the ENLIGHTOR module, a solution based on transformers — sophisticated neural networks that generate high-performing statistical language models. The reliability of these models, which are able to capture the most subtle of language features, is partly due to a new class of algorithms that can calculate a contextualized numerical representation for each word in a sentence according to an ‘attention’ principle. By taking into account the multiple potential contexts in which a given word can appear, the information can be encoded in a very precise and differentiated manner, which is a prerequisite for the extraction of even implicit knowledge. The extracted content is then evaluated both automatically via an ad hoc dataset and manually by the physicians involved in the project. These evaluations serve to improve the AI in an iterative way.
The TAXY project is on the homestretch. Work will continue through the POLLEN project to keep exploring how language processing can contribute to genetic medicine.