● She explains her concern that the quantity of data on which these tools are trained is being prioritised over data quality, which is leading to legal disputes over copyright, notably in the field of AI generated images.
● She also highlights the risk of algorithmic bias in content generation tools that have been trained on undisclosed data.
What is Hugging Face and can you tell us about your role as the company’s principal ethicist?
Giada Pistilli. We are a bit like GitHub for artificial intelligence, that is to say that we provide a platform for tools that developers can use to build, train and deploy machine-learning models based on open-source technologies. So our platform brings together a large community of data scientists and researchers. Within the company, my work is at the intersection of different aspects of artificial intelligence such as research on ethics and the application of ethics in AI, and AI regulation and public policy. With this in mind, I have to ask questions about the social impact of artificial intelligence and also to answer certain questions that have not been asked yet.
To what extent do open-source artificial intelligence models represent a danger?
Our role is to verify the integrity and inherent risks in certain models, because some language models can be used to create spam, conduct scams, and generate fake email, fake ratings and even fake content. For example, someone who wants to improve the moderation of posts on his or her website with a toxic language detection tool will need to build a model that is trained on toxic data (insults). The risk is that a tool of this kind could also be used for malicious purposes that subvert its intended goal, for example to create a bot that is specifically designed to generate malicious content. In cases of this kind, we ask developers working on the model to ensure that it is kept private.
It is important to know which data will be used to train models and the type of content they will generate
Is there a structured market for generative AI?
You might say that we have entered a Wild West era in generative AI. Huge models are being trained with vast amounts of data, and there is a risk that quantity will take precedence over quality. At Hugging Face, our scientific team is more focused on understanding how to find the right data, that is to say duly authorized quality data which can be legally used and shared. Earlier this year, Stable Diffusion, whose model enables users to generate digital images, was sued by Getty Images for copyright infringement. The outcome of this case will create a major legal precedent.
What questions should companies ask before building an artificial intelligence model?
It is important to know which data will be used to train models and the type of content they will generate. If they are to produce images, you need to ask questions about copyright and the consent of people in photos. With regard to Stable Diffusion, research has shown that data used to train the model included pornographic material. Therefore the ethical question that needs to be asked is, on the one hand, where does this data come from, to what extent can you make use of it and for what purpose? Even users with the best of intentions can generate soft-porn content when interacting with these tools, so their use by minors is very problematic. The same applies to tools that can create , which can decontextualize images of well-known people so as to deceive buyers, and create fake products etc. So it is vitally important to clearly state that images of this kind have been created by artificial intelligence and to seek to anticipate and mitigate these risks.
What about written content?
This question is often raised with regard to images, because it is easier to identify an artist’s style, but it is also an issue with language models: certain models may have been trained using books and newspaper articles that are protected by copyright. At the same time, it is also important that these models are trained with data that is diversified. If that is not the case, as we are dealing with statistics, this will create a bias and the AI will respond with the same arguments on a given theme, or only speak about one person, which risks discriminating against anyone else.
For me, the idea of plugging a language model into a search engine is highly problematic, because there is nothing more inaccurate, not least because data sources are not in any hierarchy, and it is not acceptable to have heterogeneous sources on the same level, for example, to have a scientific article ranked alongside a blog post by a processed food brand. The solution may be to design more closed system chatbots, because, to date, this is the only way to keep control of content.
A deepfake is a piece of video or audio content generated through the use of an artificial intelligence model, usually with the intent to deceive listeners or viewers.