"We instead have a system of questions and answers devised on the basis of predetermined scenarios with a great deal of human intelligence behind artificial speech"
The process of creating synthetic speech, or Text-To-Speech (TTS), begins with human speech recordings, an essential raw material. Teams at Orange are working to make artificial speech as natural as possible without humanising it.
Speech is one of the most important forms of human communication, and goes far beyond a mere juxtaposition of words. It incorporates several forms of expression, such as its pitch, its melody, or even its silence. These are all contributory factors in how far a given message can reach, and the Orange teams currently working on artificial speech are taking inspiration from such messages to transcribe the spoken word.
“Working on this technology to make it sound natural is one of the most challenging aspects of our business”, says David Carvalho, Vision & Design Director at Orange.
To do so, the teams must first observe how the human phonatory system works as the original producer of speech and how it is perceived.
Voiced and unvoiced sounds, pitch, phonemes and logatomes
The timbre of the sounds produced by each individual is defined by how the vocal cords vibrate. When the cords are stretched, the air flow causes them to vibrate and push the air through the various cavities. This allows for the production of what is known as a voiced sound. By contrast, when the vocal cords are relaxed, air passes freely through the larynx without vibrating them. This results in an unvoiced sound, such as silence.
The movement of the vocal cords produces a sound intimately linked to the way in which it is perceived. This is known as the pitch, in other words the frequency containing information about the intonation of the sentence and also about the speaker, including their emotional state. Variations in pitch help to distinguish an affirmation, a question or an order. These are all characteristics of human speech that are integrated into synthetic speech technology. Human speech can even be considered the raw material, as synthetic speech is created by selecting units from a recorded text, also known as Text-To-Speech (TTS).
At Orange, the voice of actress Catherine Nullans embodies the brand’s sound signature. She recorded dozens of hours of texts with a timbre and tone which had to remain neutral and uniform throughout the reading of the script. “The quality of the synthetic speech relates directly to how the speaker’s voice performs and in particular its consistency”, emphasises David Carvalho.
Once recoded, the text is then broken down into phonemes, which are themselves composed of logatomes. A phoneme represents the sounds that must be spoken. “For example, in the word “Papa”, there are two phonemes because the intonation of the first “pa” is different from the second”, notes Pascal Taillard, a Sound Designer at Orange. The French language is formed of 36 phonemes, while English has 46.
Logatomes consist of syllables or groups of syllables with particular articulatory and acoustic characteristics. They have no precise meaning, but sometimes serve to aid pronunciation. There are more than 1,000 for French.
Composing speech as a series of notes
These sentences are then transcribed phonetically and organised one after the other. This is known as the concatenation of the phonetic chain. Algorithms then fix the phonemes in place by applying a prosody, i.e. the correct melody (changes in pitch), the correct rhythm (changes in duration) and the correct intensity (changes in energy). “You have to look at the composition of speech as a series of notes, which can be used to form sentences with the help of specific tools”, notes Pascal Taillard.
At the same time, information architects build dialogue trees that structure the message path in the form of a tree-shaped diagram, with each branch representing a major field of skills, such as weather, time, etc. According to David Carvalho, “everything is programmed in a fairly traditional way using automated processes to recognise what is being said, before this is linked to the correct field of skills”.
Once these steps are complete, linguists listen to how the words are pronounced, check whether the articulation is correct and whether the links are in place, as well as correcting any errors. “This is also when we are able to identify if certain sounds do not correspond to natural speech”, explains Pascal Taillard. For example, voiced sounds that should not be voiced, or vice versa. Certain words may be removed from the text, whereas others may be added. Sentences may appear far too long, without time to breathe. Lastly, texts written in the affirmative may be converted to the interrogative in order to respect certain conventions, such as courtesy. “The process of concatenation does not allow for emotion to be transcribed. However, emotion is required for artificial speech to be as natural as possible”, observes Pascal Taillard.
Three existing systems to inject expression into artificial speech
Furthermore, there are several existing systems that are used to avoid remaining in a constantly neutral tone and to give some meaning to oral expression:
The first involves inserting SSML (Speech Synthesis Markup Language) tags that vary the neutral speech base and allow for improved control of speech output. For example, the <break> tag can be used to add a longer pause between two sentences and the <prosody> tag speeds up speech.
An alternative solution is to add interjections like “Hmmm” and “Um”, even sound gimmicks inspired by video games. This is one way to add emphasis to the speech at a given time for specific information.
The last solution is to use pre-recorded phrases. “The audio file gives us the intonation that Orange wanted”, says Pascal Taillard. The fact is that all vocal output requires preliminary written work in order to consider several types of responses, taking into account the personality of the virtual agent. Analysed using algorithms, the scripts are formulated by databases that are fed by teams of scriptwriters from the worlds of film, comic strips and television series. In other words, people who are accustomed to creating character identities (e.g. through style, tone and personality). This work is also approved by the Brand Division teams in order to maintain a tone that is warm, down-to-earth, bold and positive.
Knowing how to understand questions
The algorithms identify the different contexts by characterising them and providing a response adapted to the environment. If the virtual agent has not recognised the area of expertise required by the question with a sufficient degree of confidence, it is able to ask another question to ensure it provides the right answer. “We plan to create a book of dialogue which will help in deciphering the meaning of the question, without actually arriving at a dialogue that takes into account the response of the speaker asking another question that references the first”, explains Pascal Taillard.
“We instead have a system of questions and answers devised on the basis of predetermined scenarios and, even now, with a great deal of human intelligence behind this artificial speech”, says David Carvalho.
The Orange teams are working to observe the development speeds of these technologies into which artificial intelligence (AI) is increasingly being introduced. In the future, virtual agents may be given the chance to imitate or even emulate the methods of human expression, starting with speech.