“Our tool can operate as a sort of semantic search engine.”
Online forums are highly popular among web users due to the possibility they offer them of obtaining relevant answers to the questions they raise. However, to achieve that relevance, the question has to be first understood, whichever words are used to formulate it. Yet in their semantic analyses, most search engines rely mainly on the words used in the question rather than on its meaning. In fact, they will not see the equivalence between two phrasings that are similar in semantic terms but are different because of the words they use. One example might be the related questions “How do I get someone to look after the children?” and “Where can I find a baby-sitter?”
From expert knowledge of data to learned knowledge
Delphine Charlet and Géraldine Damnati, research engineers at Orange Labs and language experts, are part of the Deskin team (it means “to learn” in Breton) and are deeply interested in this subject area. After starting out analysing speech so as to recognize words or speakers, they are today studying the semantics of natural language. In 2017, they came first in resolving a task set at SemEval, an international semantics competition.
Delphine explains: “Semantics signifies the meaning of texts. Historically, automatic natural language processing (NLP) has largely relied on expert knowledge developed by linguists and lexicographers. For example, they would list “Automobile” and “Car” as synonyms and identify “Ford Model T” as a make of car. Today, datasets like these exist for many languages , But not for all languages and not in all application areas, because the process is long and difficult as it requires human supervision.”. Other technologies have now emerged with statistical analysis and, more recently, Deep Learning, where it is possible to infer knowledge bases on massive text corpora without necessarily relying on a database that has already been annotated by a human.
Fully understanding the meaning of a text in fine detail is the Holy Grail of artificial intelligence. In many cases, however, the real need is to understand an utterance roughly in order to process large volumes of data. This low-level processing is enough to help people find the right information, whereas in high-level processing there is a need to reliably understand the full meaning of any text extract.
Forums a strategic research field
Online forums have proved to be highly valuable and rich in information. In them we can see a genuine human collective intelligence at work where some people come along with problems and others with solutions. But forum content is as yet under-exploited. Starting with knowledge bases, it is possible to automatically answer questions of the “Who”, “What” or “How many” type, such as “How tall is the Eiffel Tower?” or “Who assassinated Abraham Lincoln?” Conversely, it is far more difficult to answer “Why” or “How” questions.
“The “question & answer” paradigm is important in the field of artificial intelligence. It is even, in a sense, core to it: I ask a question and a smart machine gives me the answer. Our approach is different: when we ask a question, we try to identify all the similar questions that have already been asked and to flag up all the (human) answers that have already been given,” explains Géraldine.
Calculating semantic similarity
For the past decade, the annual SemEval international competitions involved large numbers of teams from all over the word, working on a variety of semantic analysis tasks. During the SemEval 2017 campaign, a “Community Question Answering” task precisely tackled the problem of identifying similar questions in forums. When asking a question on a pre-defined corpus, Google displayed the ten best results. The challenge was to improve on Google! The campaign test data concerned an English-language forum for western expats in Qatar, dealing with all sorts of everyday life topics (where to find the best restaurant, how to hire a child minder, which is the best bank, etc.). “Our team won the competition with a robust solution able to calculate the semantic similarity between words, even in data that was “infected” by spelling or grammar mistakes,” says Delphine. The approach adopted by the Deskin team consisted in searching for similar words, not just identical ones, by setting parameters for the automatic processing model. Machine learning is used to process the entire history on the forum concerned in order to learn the representations of each of the worlds according to the context they appear in. “This technique of word embedding finds similar word meanings based on contextual comparisons,” says Géraldine. “One of the benefits is that the model is indifferent to local mistakes. The word “babys” spelled with a “y” , for example, will be identified as “babies” thanks to the other items on either side of it.”
Multiple potential applications
What kind of context will derive most benefit from this solution? A first natural application concerns the Orange forums, the customer care service, and self-troubleshooting. A prototype is being developed on the Orange forums assistance base by identifying the right paradigms for this dedicated model. Looking ahead, the field is far bigger, because the tool can operate as a sort of semantic search engine. For example, it is possible to search for very precise information in technical documents, carry out biomedical searches (on tests, reports, diagnostics, and on any patient or physician records), produce aids for data-journalism, both for readers looking for information and authors carrying out fact-checking.
And further ahead, another target will be achieved with the analysis of syntax, textual structure, predicates and arguments. “With our current approach, the sentence “Peter repaired Paul’s car” will help us to understand that this is about auto repairs, and that two people are concerned, but without knowing exactly who actually helped who. By identifying the semantic roles of the constituent parts of the sentence, our team is trying to improve detailed understanding of texts,” added the two researchers.