Artificial intelligence | Article | Explore

P-C. Langlais (PLEAIS): “Our language models are trained on open corpora”

Stock image showing a black man’s face looking into a computer screen in an open plan working office. Type is being added to the screen by an Artificial intelligence, AI, chatbot.

Tuesday 7th of May 2024

Reading time: 3 min

• Start-up Pleais has developed a range of language models for specific bureaucratic and administrative purposes, among them retrieval augmented generation, which are much smaller than giants like Mistral and ChatGPT.
• The company, which emphasizes transparency, ethics, and respect for copyright, aims to provide customers with a guarantee that its language models meet strict regulatory requirements.
• Pleais cofounder, Pierre-Carl Langlais, explains why a strategy focused on dedicated solutions is ideal at a time when Europe is faced with a shortage of GPU infrastructure.

How did you come to found PLEAIS? What are the problems encountered by users of large language models that you aim to resolve?

As a digital humanities researcher, I worked on the analysis of corpora, notably archives of newspapers to understand how different classes of text have been perceived over time. That is how I came to be interested in artificial intelligence tools, and I quickly realized the growing role that would be played by language models. In particular, it occurred to me that certain professions would need to know how these models are trained, and to be certain that they do not infringe copyright legislation. Clearly some models need to be audited to guarantee transparency and to ensure that they are legally compliant. The choice of sources that we use to build AIs raises major ethical issues with important cultural and political implications. And this also relates to the issue of market competition because all organizations, regardless of their size, need to obtain access to models that are adapted to their needs.

In concrete terms, what does your approach involve and what is your value proposition?

We were initially inspired by a Chinese project, QWEN, which developed a number of highly effective models that varied in size from 500 million to four billion parameters. Our goal is to provide a wide range of systems, which can run on affordable GPUs or even CPUs and be hosted on local infrastructure, for public services, and the banking and healthcare sectors. I should add that we are not planning to charge for these models, but to sell related products like integrated research tools. Our public service model, which we have christened Albert, is a generative AI flagship project for French public authorities. It enables users to perform tasks like summarizing reports and simplifying administrative language, while complying with the highest ethical standards.

We train our models on open corpora of out-of-copyright texts, notably because licensed corpora are an obstacle to competition in the field of AI.

Why build smaller models, and not a large language model like ChatGPT?

We aim to show that smaller language models can be trained using open data. It’s an approach that notably takes into account the shortage of GPU infrastructure in Europe, but we are also convinced that smaller models can be highly effective in specialised contexts, for targeted uses like document analysis, administrative procedures, and purely bureaucratic tasks etc. And let’s not forget that mainstream general-purpose models like Mistral and GPT require extensive configuration to efficiently process documents for specific sectors.

Why do you think regulated professions and public services will make greater use of your models?

The European AI Act introduced a principle of responsibility for generative AI, which lies with the creator of the model or with the person who deploys it. On paper, it is the person who deploys the model who is responsible for generated content. However, this results in significant tensions between the parties because it is almost always impossible to verify the manner in which a model has been trained. This is an issue for the private sector, which includes regulated fields like finance and healthcare that need to comply with specific regulations. At the same time the public sector also has to satisfy an obligation for transparency. That is why we train our models on open corpora of out-of-copyright texts, mainly PDFs, notably because licensed corpora are an obstacle to competition in the field of AI.

Common Corpus : un corpus de textes libres de droit pour nourrir les LLM [in French]

IA générale : « A l’heure actuelle, on ne parvient pas à faire une intelligence autonome » [in French]