Must our voice assistants imitate the power mechanisms of our society in order to better integrate into our habits or do they, on the contrary, provide an opportunity to surpass these?
According to a Capgemini study, 40% of consumers will prefer to use a voice assistant rather than a website or application in three years’ time. While one in two people surveyed already use voice assistants (in particular via a smartphone for 81% of them), one third of the 5,000 respondents, living in the United States, the United Kingdom, France, and Germany, stated that they are even ready to purely and simply replace the customer interface by a synthetic voice in physical shops. An appetite that is going to transform shopping experiences, from sales logistics and services to the consumer.
However, there are still some obstacles to the mass adoption of this technology, to start with the voice assistant’s credibility: to what point can we make it possible to forget that it is only a robot and how can its synthetic voice seem friendly or empathetic enough to gain our confidence?
Futuristic films such as Spike Jonze’s provide a glimpse of the turmoil that artificial intelligences would bring about if they were equipped with a voice and personality so charming that we could fall in love with them. But between this cliché and reality, there remains a valley to be crossed.
The “uncanny valley”
As roboticists have known since Masahiro Mori’s work in the 1970s, an android technology must be extremely believable to be adopted. Conversely, any imperfect attempt to imitate a human will be purely frightening, arousing in us the fear of illness and death – it is the notorious “uncanny valley”, for which there is no lack of visual examples.
And in terms of sound, who has never felt ill at ease upon hearing the robotic voice of a bad customer service? In a TED Talk, speech-language pathologist and linguist Joana Révis explains that the synthetic voice lacks realism because it “has no intention. It is neutral, in all circumstances”. By contrast, it is the huge variety of emotions colouring the tone of a human voice that give it its aesthetics… and its ability to persuade.
But this “valley” is it really so difficult to cross? In reality, the suavest synthetic voices are perhaps only a well-identified and accessible technological leap away: that of the artificial-intelligence-generated voice. An example is the engineers from the company Dessa, who managed to realistically recreate the voice of podcaster Joe Rogan by feeding the whole 1,300 hours of his show to an AI.
Not content with creating worries about potential identity frauds via this technology, Dessa declared that in the next few years we shall see it advance “to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet”.
Social sciences to the rescue
Beyond the necessity for resemblance with the natural voice, biology and the social sciences are useful to guide innovators in the choice of a “good” voice for our voice assistants. Indeed, different studies show that some voices are more efficient than others depending on the situation, sometimes for hormonal reasons: a deep voice, for example, is an indicator of good health that we recognise instinctively and that unconsciously influences our choices… for the good of the evolution of the species.
Sometimes, unbeknown to us, we are influenced by political and social factors: in a society that is dominated by men, thus the deeper they are the more women’s voices are found to be credible – which no doubt explains why the mass arrival of women in many professions and jobs with responsibility has gone hand in hand with a lowering in frequency of female voices (from 229 Hz in 1945 to 206 Hz in 1993).
Must our voice assistants therefore imitate the power mechanisms of our society in order to better integrate into our habits or do they, on the contrary, provide an opportunity to surpass these?
This is the question that motivated a group of Danish linguists, computer scientists, and sound designers to create Q, the first genderless synthetic voice for voice assistants. By working on this voice at the frequency between 145 and 175 Hz, which is common to both women and men and perceived as the most neutral, they developed a voice that doesn’t reinforce gender stereotypes. And contributes to a more appeased future?