• With the ability to understand customers’ gestures and body language as well as speech, omnimodal AI can reduce the workload on customer relations staff, giving them more time to focus on high added-value tasks.
• The developers of omnimodal AIs still face significant challenges, notably with regard to the integration of multiple modalities in a single model and the need to eradicate biases that can spread from one modality to another.
Will we soon see artificial intelligence tools with the capacity to better understand and functionally imitate humans? This is one promise of omnimodal AI, a major development focus in artificial intelligence that aims to provide smoother interaction between users and machines. Unlike multimodal AI, which processes different data types separately, omnimodal AI combines them coherently in a manner that mirrors human perception. For example, multimodal AI cannot understand and modify photos in a single step. First it needs to encode images as descriptions, which are then sent to diffusion component for editing. Omnimodal AI, on the other hand, can accomplish these tasks in a unified process, using a single integrated model. As Orange VP for Research on Augmented Customers and Collaborators Thierry Nagellen explains, “Omnimodal AI has the capacity to process and generate multiple data types or ’modalities’ — text, images, audio and video — in a totally fluid and integrated way.”
Omnimodal AI will give customer representatives more time to focus on high added-value tasks
Natural interaction
Omnimodal AI will pave the way for new applications, notably in customer relations, which stands to benefit from more immersive and contextualized experiences for customers. “It is a fascinating area of development that will need to be carefully managed to ensure compliance with the AI Act which has strict rules on emotion recognition that aim to prevent unethical manipulation. Much like human salespeople, AIs will need to learn sales techniques that are authorized.” The idea is that AIs will not only understand speech but will also be able to interpret customers’ gestures and body language capable, so as to quickly respond to their needs. “This will give customer representatives more time to focus on high added-value tasks like providing advice and building loyalty,” points out Thierry Nagellen. And unlike human employees, fully autonomous, native AI agents with the capacity to manage complex tasks can be made available around the clock to provide 24/7 customer service.
Outstanding technical challenges
Current omnimodal AI systems are not yet perfect. “We still need to improve the quality of processing and recognition,” notes Thierry Nagellen. There are many challenges: the integration of different modalities in a single model is a complex task requiring precise temporal and spatial synchronization of data (not unlike the synchronization of audio and video tracks). High training costs engendered by the extensive computational resources required to process different data types for omnimodal AI are another obstacle. And when it comes to limiting algorithmic biases, the possibility that they can spread through communicating modalities is yet a further headache for developers. Last but not least, the bid to explain the reasoning behind results generated by omnimodal AIs is often impeded by the fact that they are based on highly complex interaction between different data types.
Will we soon see omnimodal phones?
In May 2024, Open AI released its first omnimodal model, GPT-4o, which demonstrated a potential for versatility while also highlighting significant issues with biases and hallucinations that still affect AIs. For Thierry Nagellen, questions remain about how the technology will be deployed: “The mass adoption of omnimodal AI will also depend on its capacity to provide a seamless user experience. Users of today’s smartphones are required to switch between applications to process different modalities, and some operators are already considering the possibility of smartphones with AI-based rather than application-based interfaces, but there is no guarantee that they will deliver an optimal user experience. There is also the question of whether omnimodal models will run in the cloud or on devices, which will have an impact their compliance with data protection regulations.”
