• A PhD student working on a project in partnership with Adobe, Eric Slyman has developed a method to preserve fair representation in model training data.
• The new algorithm christened FairDeDup, which can be applied to a wide range of models, reduces AI training costs without scrificing fairness.
You published a research article with Adobe on the role of deduplication in the creation of social bias. How does that come about?
Our research focuses on practical aspects of implementing fairer AI. The training of large AI models is typically done using web-scale data with billions of data points which is extremely expensive. One way that companies go about trying to make this process cheaper is deduplication, which involves removing exact copies or very similar instances of the same data point to optimize storage and make systems faster etc.
Instead of using 10,000 photos of doctors, we might only use 5,000 (…) Accuracy is largely unaffected, but social bias in the model can change quite a lot.
For example, if we are training a model to classify images, instead of using 10,000 photos of doctors in its dataset, we might only use half that number, i.e. 5,000. This makes it possible to train the model more quickly, and the difference in terms of the accuracy of results is extremely marginal: you go from 93% accuracy to 92.8% for half the price. However, our research shows that there is another trade-off which is that while accuracy is largely unaffected, levels of social bias in the model can change quite a lot. For example, in this particular case, reducing the number of photos could lead to an over-representation of white male doctors over age 40, and, at the same time, the model might also be more likely to identify women of colour as nurses rather than doctors.
Does your method remove all biases?
The goal of FairDeDup, which is short for fair deduplication, is to remove data in a way that still enables us to benefit from much greater speed with hardly any reduction in accuracy but without sacrificing fairness in the process. The method, which prunes data upstream before models are trained, is designed to ensure that biases relating to professions, gender, race and culture etc. are not amplified. It does not rout out all biases but identifies neighbourhoods in clusters of similar data and keeps one sample from each of these which will subsequently be included in training datasets. The algorithm then uses another AI to instruct the model to maintain a balanced representation between the samples. For instance, this AI will be able to tell if a population is over-represented. In the form of natural language, it can, for example, warn an engineer to make sure he includes data containing women of colour.
Does this only apply to diffusion models or can it be used for others too?
The dataset level intervention in our approach means that it is applicable to all models, including language models, vision models and foundation models. And typically, we are looking at foundation models like CLIP [“Contrastive Language–Image Pre-training”] which learns underlying associations between images and text. These models are then used to train diffusion models and to provide guidance in systems where there is a vision component like ChatGPT 4.0. Thus, by intervening upstream we can already contribute to fairness in very large models and multimodal AI.
You say that it reduces training costs…
Our method enables companies to speed up research and development. When they have multiple versions of a model and they are trying to decide which one to deploy next, they can’t train all of them on the full dataset every time. They have to use techniques to make the process faster, and our approach is one of them. It cuts the cost in half, so it’s a cost-effective approach, but it’s also ethically valuable.
What are your plans for further research?
We are moving on to another stage in this process. What I’m really curious about right now is that people are trying to figure out how to scale up the evaluation of their AIs. We mentioned diffusion models earlier and scaling up our ability to use other AIs to judge outputs of diffusion models. I’m investigating ways to do human-in-the-loop judgment and evaluation, and the fairness implications and outcomes of using AI to judge other AIs, which is cheap and effective most of the time, but I also think there’s some pretty significant caveats to doing it without more oversight. Right now, there are a few researchers studying this area, which they call meta-evaluation, but the topic is very new and there have only been a few published papers to date.