Imagine a meeting where you could instantly have access to information about the people, organisations and places mentioned by the other participants.
Imagine a meeting where the documents needed to resolve a problem could automatically be made available to you with only your smartphone and a connection to an online service. This is a real technological challenge that Orange researchers are looking into as part of their work on augmented meetings…
Real-time processing based on audio recordings
For meeting participants to benefit from the service, it’s important that it knows who’s speaking and what’s being said. Participants’ smartphones therefore provide an opportunity, as pieces of outstanding professional equipment. We take them with us everywhere, and their audio and video recording capabilities and computing power make them a true contributor to ambient intelligence.
The initial idea was to harness the power of all the smartphones in a meeting and use it to benefit the meeting’s participants. Each smartphone is linked to one speaker. The smartphones are connected to an online service that transcribes the audio and saves the video when the participant activates this feature. These streams then serve to produce a multimedia report either in the form of a tree that shows the important points from the meeting, or in the form of a video composed of thumbnails of the speakers present.
However, activating the smartphone’s microphone poses a problem. Each speaker’s voice is captured by their own smartphone’s microphone, but it may also be picked up by the microphone of the person next to them in the meeting. To overcome this issue, the noise levels of the audio captured by each smartphone can be analysed in order for the speaker of the phrase in question to be identified. This process, patented in 2019, also allows the layout and set-up of the meeting to be reproduced without any intervention from the user, thanks to the comparative analysis of the audio levels. Since each audio stream is saved separately, it is now possible to reproduce a meeting with spatial audio.
All of this requires a huge amount of processing and saving to be done simultaneously. To save storage space and reduce unnecessary processing, a new method using smartphone cameras has also been proposed. When placed on a stand and directed towards its owner, each smartphone can perform facial analysis—and analysis of mouth movements in particular—to identify whether that person is speaking or not. If not, there is no need for the captured audio to be transcribed, nor for the audio and video streams to be saved. This is perhaps not a solution for a meeting of ventriloquists…
A complex processing chain
The processing chain used in the prototype employs several Artificial Intelligence algorithms and requests a range of modules including speech transcription; syntactic and semantic analysis of phrases in order for proper nouns, dates and places to be extracted; and a module that can detect particularly important moments—expressive structures such as questions, problems, and actions—using classifiers trained on generic data sets.
At the end of the processing chain, a semantic similarity calculation module can detect whether the phrases used in meetings are similar to other content that could help participants or develop the conversation (). This content is loaded into the system before the meeting. In the prototype demonstration given at the Orange Research Exhibition in 2019 (), the French Labour Code had been pre-loaded and articles from it were suggested as needed. The aim here is clear: to reduce the time spent looking up information, which currently represents one fifth of the overall meeting duration.
These different processing stages are now done in real time, which is the principle of Fast Data. The system uses the spoken word. Important events are linked to larger volumes of data using Big Data semantic technologies to help resolve problems or generate ideas for participants (). Content suggestions are then offered directly on the participants’ smartphones.
However, each module in this complex chain can produce errors, which may then be passed on to the rest of the processing chain. This can lead to bad suggestions. Not to worry, it’s a human lapse. Oh no! Is it coming from the system, perhaps? If errors are made, participants can correct them during or after the meeting, as important conversations are kept and can be consulted as needed. Automatic classifiers are then fed into new data sets, creating a virtuous circle whereby the system also learns from its mistakes.
At the end of the meeting, the visualisation module combines all the manipulated data and displays information in a structured way within a single multimedia report. This report can take several forms depending on the users’ needs: either a tree or a heuristic map that shows the items on the agenda and links the questions, problems and other key phrases from the meeting to the corresponding individuals, organisations, or places. When a particular node in the report is selected, the speaker’s audio and video appears, allowing specific moments to be replayed. In all cases, this data is made available in a controlled and reasoned manner with the participants’ consent ( General Data Protection Regulation — GDPR).
The many other avenues to be explored for “augmented” meetings
There are plenty of ideas to enhance the system and make it more reliable.
Analysing the intonation or prosody of the phrases used could enable important meeting events to be better detected.
The expression recognition classifiers used could be adapted depending on the typology or business context of the meeting. This variability impacts understanding of the meeting. In any case, experimental checks will be required and will allow new training data to be obtained.
The introduction of automatic change-of-subject detection techniques could also result in more accurate meeting reports, enable speech to be clearly differentiated, and allow meetings to be cross-referenced against each other or against other business documents, in a multimodal approach that facilitates the extraction of new knowledge .
Improving the flow of information exchanged between meetings and other work situations is another area of focus. Content suggestions or contextual assistance displayed on the participant’s smartphone during the meeting could also be personalised, by connecting the existing system to a professional assistant or buddy application with knowledge of these activities. On the other hand, throughout the day, the buddy could give the participant suggestions with information from the meetings they have been invited to.
The prototype was presented at the 2019 Orange Research Exhibition in the meeting room of the future (). Smartphones were placed on stands aimed towards speakers and captured their conversations. The heuristic map produced by the system in real time was projected onto a big screen.