Our brains can receive and interpret different data types at the same time, whether textual, visual or audio data. An individual can form a nuanced idea of reality from all these elements. In computing, multimodal AI offers a similar capacity to imitate the brain by processing different types of data from different sources, like handwritten text, images, music, and video. For example, hospital admission data might include photos of prescriptions and dictated memos to explain reasons for admissions.
However, moving from unimodal AI to multimodal AI involves many challenges: unimodal models are all trained differently
However, moving from unimodal AI to multimodal AI involves many challenges. Unimodal models are all trained differently. Large language models or LLMs like ChatGPT are not built the same way as models that produce images. LLMs are trained on words and image models train on pixels. Audio models are not trained on pixels but on sound waves.
So you need to build a shared space for different data types that can cope with volumes of aligned data for multimodal models to be effective. The goal in multimodal AI is to merge diverse data in a process called fusion. For example, the word “pancreas”, an MRI of a pancreas, and the sound “pancreas” must be merged into a single representation that can be absorbed by a multimodal model.
Multimodal AIs can simultaneously process diverse data for a range of applications. For example, they can produce summaries of medical data from photographed documents.
Multimodal AIs also have a more holistic understanding of their environment. Unimodal LLMs like ChatGPT, which are only trained on text, have a worldview limited to one modality. For example, they may have learned about bodily organs but, with no training on images, they can’t really visualize anatomy.
So in theory, multimodal AIs have two advantages: they can capture more complete knowledge on their environment and provide a platform for apps that combine modalities.