Unlike multimodal AI, which processes text, images, audio and video in distinct silos, omnimodal AI fluidly integrates all these data in a single model to optimize interaction with users, which is an essential priority. Unlike classic AIs that describe and analyse photos before editing them, omnimodal AIs can understand and directly modify images in a single step. This is a feature in Chat GPT-4o.
Call-center staff will have more time to focus on high added-value tasks like providing advice and building loyalty.
When interacting with customers, omnimodal AIs can concurrently analyse tone of voice, words used, and body language to precisely understand what’s required, and provide a suitable response in real time. In customer relations, it could enhance comprehension of customers’ needs, and anticipate their requests giving call-center staff more time to focus on high added-value tasks like providing advice and building loyalty.
Ethical and safety concerns
However, the technology faces technical and ethical challenges. It will need to fully synchronize different modes – text, audio, and images – to avoid biases that can spread from one modality to another, and to comply with strict regulations, notably rules on emotion recognition that aim to prevent manipulative uses. Along with call centers, this technology could transform smartphones: in the future, instead of juggling between applications, we could interact with responsive AI assistants, who naturally understand requests, whether they be oral, written, or presented in pictures. One question remains: will these AIs be deployed on local devices or remotely in the cloud? The answer will have a major impact on the protection of personal data and on the features these innovations can offer.