“Achieving fully neural processing, covering both the field of acoustic signals and that of transcription in words and text.”
At Orange, speech recognition has been the focus of many research projects over the past 20 years. Projects in this area center around internal solutions, including one originally designed as a platform for audiovisual stream analysis and automated indexing and extraction of content.
Achieving fully neural processing, covering both the field of acoustic signals and that of transcription in words and text.
The Neural Processing Breakthrough
As Henri Sanson, Head of Decision and Knowledge Technology Research, and Benoit Besset, Speech Recognition Research Engineer, explain, “Projects have historically been divided into two technological areas. One relates to the transcription of content, while the other is focused on interactive voice servers. Today, a single piece of technology can meet a wide variety of needs from a common software base. In the mid-2010s, the arrival of neural processing was a major technological breakthrough. The use and development of deep learning methods and systems coincided with a significant qualitative leap, and marked the starting point for a new field of technology.”
Toward Fully Neural Solutions
The speech recognition systems that emerge from this breakthrough favor a hybrid architecture. While neural networks are used for processing acoustic signals and transforming vibrations into phonemes, more traditional layers of software then take the lead, with the use of graphs for matching sounds with words. From 2019, an alternative technological strategy was created, based on an end-to-end neural approach. “The goal is to achieve fully neural processing, covering both the field of acoustic signals and that of transcription in words and text,” says Valentin Vielzeuf, AI Researcher in speech recognition. This single-block architecture would, for example, simplify the model’s training and optimize updates. The fully neural approach effectively simplifies training and removes the need for certain “manual” steps necessary to train a hybrid model (alignment between audio and text, definition of a glossary, noting down of disfluencies). Doing away with these steps makes it easier to process a large amount of data and therefore enables progress toward a better generalization of the model, especially when dealing with certain accents and noises.”
Ubiquitous Technologies
The transition to this new generation of systems will, however, take some time, in order to overcome any technical barriers or issues. The path to using a fully neural approach involves careful consideration of certain issues, such as a relative loss of control over what happens inside the neural network, which could invent its own words, for example.
That aside, the use of voice recognition continues to grow in the tech and digital fields. Popularized through its implementation in interactive voice systems, speech recognition can also, from the point of view of a carrier such as Orange, be used to analyze customer conversations with call centers or to recognize the voice input of reports for field engineers.
Its influence could spread further if combined with other technologies — for example lip reading, for dual audio and visual speech recognition, offering improved performance.