Thanks to neural networks, the analysis of large amounts of spoken texts and learning enable modelling of voice.
Artificial voices are being humanised and entering our everyday lives
What if an artificial voice that’s as good as a human voice was no longer science fiction? For a long time, many researchers have tried to “humanise” artificial voices, aware that personal assistants and other voice bots will only appeal to users if they swap their monotone, synthetic, choppy voices for smooth-flowing human tones. A technological prowess, if we are to consider the complexity of a spoken message, which conveys not only content, but also intonation, rhythm, language habits, breathing, hesitation, onomatopoeias, emotions, etc.
A combination of several technologies to humanise voices
To provide an artificial voice with human range requires the use of voice synthesis software services. From text to speech, this synthesis is based not only on language processing techniques that turn written text into a clearly pronounced phonetic version, but also on signal processing techniques that transform the phonetic version of the text into digital sound. These software programs then reproduce the content of the text using parts of recordings of concatenated words to produce the final sentence by inserting pauses to respect a natural rhythm of speech. The most elaborate methods compose phonetic texts using libraries of speech samples, from which small pieces of sounds are picked then put together. In order to vary intonation, the technique consists in increasing the number of sound samples. However, the inconvenience of this method is that the voice cannot be modified or the speaker changed without starting the tedious task of compilation all over again.
In statistical parametric speech synthesis, the information required to generate voice data is stored in the model’s parameters. Content and the speech’s characteristics are controlled by the input of new data.
To humanise voice, Google uses neural networks which, thanks to the analysis of large amounts of spoken text and to learning, manage to model voice. Speech is no longer cut up into syllables or even phonemes, but into tiny parts of sound – approximately 16,000 samples per second. Thanks to statistics, the system manages to predict each new sample from the thousands that precede it, and thus produce a synthetic voice.
By using the layers of neural networks, the system halves the gap between synthetic and human voices. On a scale of 1 (basic synthetic voice) to 5 (human voice), its performance reaches 4.21, compared to 3.86 for concatenation and 3.67 for the parametric method. Thanks to the raw sound wave model, the software can also generate very humanlike breaths and mouth noises, it is also capable of pronouncing the same sentence with different voices and intonations.
Synthetic voice: a wide field of application
Today, voice synthesis has already found numerous applications. In the automotive sector, for example, a synthetic voice is used, via smartphones or onboard computers, to give directions or provide information on traffic, but also to report on the some of the vehicle’s parameters. In the health sector, it gives back the ability to verbally communicate to patients suffering from amyotrophic lateral sclerosis (ALS), a neurodegenerative disease leading to the loss of voice. This is how, for years, the famous astrophysicist Stephen Hawking, who suffered from this syndrome, used voice assistance software capable of relaying what he wrote on a tablet. This technique thus enabled him to converse fairly fluently with his counterparts. For its part, the Breton startup Voxygen has enabled three patients of the Saint-Brieux hospital to be heard thanks to its software of the same name, that is capable of breaking up phonemes (basic sounds) and reconstituting speech text six times faster than real time.
For visually impaired people, voice synthesis is useful, for example, for reading emails and text messages, for making orders online, or for managing tasks remotely: turning on the heating, drawing curtains, getting reminders for medical appointments, etc. Experiments are currently being undertaken to delegate everyday tasks to voice assistants. Thus, at the Google I/O developer conference, which took place at the company’s headquarters in Mountain View from 8 to 10 May last year, Google presented Duplex, a system for making reservations and appointments via a voice assistant. Neither the hairdresser nor the restaurant manager called realised that the person speaking to them was a robot.
Finally, voice synthesis is also entering the area of teaching, for language learning.
However, depending on the purpose of the assistant or voice bot, its characteristics will be different. If it is not to be interactive but only provide “simple” information (the weather, timetables, etc.), an informative assistant without emotions can suffice. But in a context where the relationship with the customer needs to be more personal and empathetic, the vocabulary, the tone, and even the accent used will be important to reinforce closeness and the customer’s feeling of being on the same wavelength as the service provider.
Technology that is not without risks of abuse
Nevertheless, all of these humanlike voice assistants do raise ethical questions, in the same way as artificial intelligence. Thus, should the voice assistant not present itself before entering into any communication? Can it refuse to answer if one is not polite with it? How to be sure that a voice assistant does not steal an identity to extort information from its interlocutor, such as a credit card number? Combined with new video-faking techniques, the ability to reproduce famous voices is already used in this way to convey fake news (video of Barack Obama making an entirely fabricated speech).
All such abuses that are feared in particular by Kay Firth-Butterfield, head of artificial Intelligence and machine learning at the World Economic Forum and speaker at the Google I/O developer conference, who confided: “It is an important development and signals an urgent need to figure out proper governance of machines that can fool people into thinking they are human, These machines could call on behalf of political parties and make ever more convincing recommendations for voting”.
By having daily contact with voice assistants, man-machine relationships are going to become more and more complex. If for the time being dialog with these assistants remains highly constrained by prewritten and limited scripts, which leaves little doubt as to the artificial nature of the speaker (be it a bot or a voice assistant), new developments could enable more intuitive dialogs and muddy the waters between humans and artificial intelligence.