Large language models ( ) have revolutionized natural language processing ( ). Although they have proven to be highly efficient at performing many NLP tasks, they present two major problems: English is overrepresented to the detriment of other languages and in general they are “black boxes”.
Thanks to this approach, Bloom is capable of accomplishing various NLP tasks and is supposed to be more efficient than monolingual models for certain languages.
Two problems that the BigScience international research project aims to solve thanks to Bloom, a multilingual model developed and trained in an open and participatory science approach.
The large language model revolution
Language models are statistical models trained on text corpora to work out the probability of a sequence of words occurring in a sentence in natural language, taking context into account. Their first applications concern automatic translation, question answering systems, voice recognition, automatic summarization, or automatic text generation.
These last years, the emergence of a new generation of language models has revolutionized the field of natural language processing. Based on artificial neural networks, these models are characterized by a large number of parameters and unsupervised learning on very large text corpora – hence the name “large language models”. As a consequence, they are better at capturing lexical characteristics of words and sentences.
They are extremely powerful, however they do have two major limits. First, even when they have been trained on multilingual corpora, they essentially only work (very) well in English. The Meta artificial intelligence laboratory is attempting to solve this problem with its new language model called “No Language Left Behind”.
Second limit: due to their growing complexity, they are opaque, meaning it is difficult to explain their workings and their results. What’s more, the vast majority of large language models are developed by private firms and laboratories, who make them publicly available, like BERT (Bidirectional Encoder Representations from Transformers), published by Google, but communicate very little information on their learning.
Yet, given the increasing use of artificial intelligence systems and their potential impact in all areas, it is essential to gain in transparency. This implies being able to open up the black box of large language models so as to understand them better and be able to improve them. This is the objective pursued by the BigScience project.
Bring in Bloom
This project, launched in the spring of 2021 by Franco-American artificial intelligence specialist startup Hugging Face, and supported by the CNRS (French National Center for Scientific Research), GENCI and the French Ministry of Higher Education and Research, gave rise to the Bloom language model.
Bloom works in the same way as its predecessors: it predicts the probability of a word from an initial text and completes sentences, word by word. Like GPT-3, it is an autoregressive model (predictions are made step by step and the result of one prediction serves as the starting point of the following prediction) based on a Transformer architecture. It has been trained on very large text corpora and contains around the same number of parameters (176 billion). However, it sets itself apart from other models due to its “truly” multilingual and open-source nature.
The largest multilingual language model
Bloom is indeed capable of producing coherent texts in forty-six languages and code in thirteen programming languages. It includes many underrepresented languages, notably Sub-Saharan African languages such as Swahili or Yoruba. It was trained simultaneously in all of these languages from a wide variety of sources including novels, scientific articles, or sports press releases (and this without the data being sorted according to language).
Thanks to this approach, Bloom is capable of accomplishing various natural language processing tasks and is supposed to be more efficient than monolingual models for certain languages. “Combining content in various languages makes it possible to train powerful and robust models for all of those languages, and often yields better results than monolingual models”, states the CNRS.
The multilingual side of Bloom required major engineering and research work for creating good quality learning data on the one hand, and for training the model on the other.
The data used to train large models are usually gathered automatically on the internet. A complex task when seeking to cover languages that do not have widespread online presence. Here, the data were also scraped from the internet, in particular from Wikipedia, and prepared by startup Hugging Face, which also integrated existing paying text corpora.
As for training the model, this step benefited from the participation of a large community of researchers who were seduced by the BigScience adventure.
A fantastic research tool
It is the second specificity of BigScience. This open, participatory science project has brought together around one thousand researchers from over seventy countries, coming as much from academia as from private research laboratories, such as Orange Labs, working together to train a one-of-a-kind model in a completely transparent way.
Today Bloom is available free of charge under the “BigScience RAIL” license, which puts the emphasis on responsible usage. The model’s parameters are accessible for experimental purposes and the research results are shared with the whole scientific community.
This makes Bloom a fantastic research tool, aiming to advance work on large language models (LLM) and artificial intelligence in general. It should enable scientists from all backgrounds to observe the conception and running of LLMs so as to better understand and improve them. According to the CNRS, projects will also be carried out to measure the carbon footprint of these models.
A Large Language Model is a natural language processing computer program, which is characterized by a large number of parameters and unsupervised learning on very large volumes of textual data. This makes it better at capturing the lexical characteristics of words and sentences.
Natural Language Processing, or linguistic engineering, mobilizes linguistics, computing, and artificial intelligence to produce computer programs that make it possible to automate translation or question answering for example, or to operate voice recognition.