Artificial intelligence | Word of innovation | Discover

Multimodal learning / multimodal AI

Thursday 3rd of October 2024

Reading time: 2 min

• Multimodal AI - or multimodal learning - mimics the human brain’s ability to simultaneously process textual, visual, and audio information, enabling a more nuanced understanding of reality.
• Transitioning from a unimodal model (like those specialized in text, images, or sounds) to a multimodal model presents technical challenges, particularly in creating shared representations for different types of data.
• Multimodal AI offers advantages such as capturing more comprehensive knowledge of the environment and enabling new applications, like merging data from various modalities for complex tasks.

Our brains can receive and interpret different data types at the same time, whether textual, visual or audio data. An individual can form a nuanced idea of reality from all these elements. In computing, multimodal AI offers a similar capacity to imitate the brain by processing different types of data from different sources, like handwritten text, images, music, and video. For example, hospital admission data might include photos of prescriptions and dictated memos to explain reasons for admissions.

However, moving from unimodal AI to multimodal AI involves many challenges: unimodal models are all trained differently

However, moving from unimodal AI to multimodal AI involves many challenges. Unimodal models are all trained differently. Large language models or LLMs like ChatGPT are not built the same way as models that produce images. LLMs are trained on words and image models train on pixels. Audio models are not trained on pixels but on sound waves.

So you need to build a shared space for different data types that can cope with volumes of aligned data for multimodal models to be effective. The goal in multimodal AI is to merge diverse data in a process called fusion. For example, the word “pancreas”, an MRI of a pancreas, and the sound “pancreas” must be merged into a single representation that can be absorbed by a multimodal model.

Multimodal AIs can simultaneously process diverse data for a range of applications. For example, they can produce summaries of medical data from photographed documents.

Multimodal AIs also have a more holistic understanding of their environment. Unimodal LLMs like ChatGPT, which are only trained on text, have a worldview limited to one modality. For example, they may have learned about bodily organs but, with no training on images, they can’t really visualize anatomy.

So in theory, multimodal AIs have two advantages: they can capture more complete knowledge on their environment and provide a platform for apps that combine modalities.

Benjamin Grelié

Benjamin studied computer systems for five years at Epitech. He began his career as a consultant at Sopra Steria. Following these experiences, he co-founded Printic, a mobile photo printing app that he led for four years, before joining the executive team at MonAlbumPhoto. In 2017, he partnered with Emmanuel Bilbault to found Posos.