P-C. Langlais (PLEAIS): “Our language models are trained on open corpora”

PLEAIS Stock image showing a black man’s face looking into a computer screen in an open plan working office. Type is being added to the screen by an Artificial intelligence, AI, chatbot.
• Start-up Pleais has developed a range of language models for specific bureaucratic and administrative purposes, among them retrieval augmented generation, which are much smaller than giants like Mistral and ChatGPT.
• The company, which emphasizes transparency, ethics, and respect for copyright, aims to provide customers with a guarantee that its language models meet strict regulatory requirements.
• Pleais cofounder, Pierre-Carl Langlais, explains why a strategy focused on dedicated solutions is ideal at a time when Europe is faced with a shortage of GPU infrastructure.

How did you come to found PLEAIS? What are the problems encountered by users of large language models that you aim to resolve?

As a digital humanities researcher, I worked on the analysis of corpora, notably archives of newspapers to understand how different classes of text have been perceived over time. That is how I came to be interested in artificial intelligence tools, and I quickly realized the growing role that would be played by language models. In particular, it occurred to me that certain professions would need to know how these models are trained, and to be certain that they do not infringe copyright legislation. Clearly some models need to be audited to guarantee transparency and to ensure that they are legally compliant. The choice of sources that we use to build AIs raises major ethical issues with important cultural and political implications. And this also relates to the issue of market competition because all organizations, regardless of their size, need to obtain access to models that are adapted to their needs.

In concrete terms, what does your approach involve and what is your value proposition?

We were initially inspired by a Chinese project, QWEN, which developed a number of highly effective models that varied in size from 500 million to four billion parameters. Our goal is to provide a wide range of systems, which can run on affordable GPUs or even CPUs and be hosted on local infrastructure, for public services, and the banking and healthcare sectors. I should add that we are not planning to charge for these models, but to sell related products like integrated research tools. Our public service model, which we have christened Albert, is a generative AI flagship project for French public authorities. It enables users to perform tasks like summarizing reports and simplifying administrative language, while complying with the highest ethical standards.

We train our models on open corpora of out-of-copyright texts, notably because licensed corpora are an obstacle to competition in the field of AI.

Why build smaller models, and not a large language model like ChatGPT?

We aim to show that smaller language models can be trained using open data. It’s an approach that notably takes into account the shortage of GPU infrastructure in Europe, but we are also convinced that smaller models can be highly effective in specialised contexts, for targeted uses like document analysis, administrative procedures, and purely bureaucratic tasks etc. And let’s not forget that mainstream general-purpose models like Mistral and GPT require extensive configuration to efficiently process documents for specific sectors.

Why do you think regulated professions and public services will make greater use of your models?

The European AI Act introduced a principle of responsibility for generative AI, which lies with the creator of the model or with the person who deploys it. On paper, it is the person who deploys the model who is responsible for generated content. However, this results in significant tensions between the parties because it is almost always impossible to verify the manner in which a model has been trained. This is an issue for the private sector, which includes regulated fields like finance and healthcare that need to comply with specific regulations. At the same time the public sector also has to satisfy an obligation for transparency. That is why we train our models on open corpora of out-of-copyright texts, mainly PDFs, notably because licensed corpora are an obstacle to competition in the field of AI.

Read also on Hello Future

A man in a safety vest reviews documents in front of a row of colorful shipping containers at a port.

Contraband: AI efficiently detects anomalies in shipping containers

Discover

Artificial intelligence: how psychology can contribute to AGI

Discover

Explainability of artificial intelligence systems: what are the requirements and limits?

Discover

Data and AI Ethics Council, guarantor of responsible AI at Orange

Discover
A group of people is attending a presentation of BrainBox AI at the Orange OpenTech event. Two presenter stands in front of a screen displaying graphs and information on the topic. The participants are listening attentively and appear engaged in the discussion.

BrainBox AI to cut commercial real estate emissions by up to 40%

Discover
A group of mining workers is listening to a colleague who is explaining something. They are wearing yellow safety helmets and masks. The environment is dark, with rocky walls visible in the background. The guide is using a headlamp to light his way.

AI fed on data from gas sensors and smart cameras prevents workplace accidents

Discover

Factiverse: reliable AI fact-checking in more than 100 languages

Discover

Machine learning for intuitive robots that are aware of their environment

Discover