Artificial intelligence | Article

Self-supervised learning paves the way for a common-sense AI

Friday 2nd of July 2021

Reading time: 5 min

After having facilitated major advances in natural language processing, self-supervised learning is now being applied to computer vision. This method of machine learning, which does not require manual labelling of data, could help to close the gap between human intelligence and artificial intelligence.

“The final version of the SEER model has reached an accuracy rate of 84.2%.”

In March 2021, the Facebook Artificial Intelligence Research (FAIR) laboratories unveiled a new computer vision model developed in cooperation with the French National Institute for Research in Digital Science and Technology (Inria).

The particularity of this model, named “SEER” (for Self-supERvised), is that it has been pretrained on one billion random, unlabelled Instagram images, thanks to self-supervised learning (SSL).

According to Facebook, this method of machine learning makes it possible to tackle tasks that greatly exceed the current capacities of artificial intelligence (AI) and it is the beginning of a new era for computer vision.

In a blog post, Yann LeCun, Chief AI Scientist at Facebook, states that self-supervised learning is “one of the most promising ways to build … background knowledge and approximate a form of common sense in AI systems.” This common sense, “the dark matter of artificial intelligence”, helps humans to acquire new skills without an overly long learning period.

Automatic labelling

Self-supervised learning (SSL) is a learning method in which training data is labelled automatically.

Unlike unsupervised learning, SSL is still based on annotations and metadata, but these metadata are generated autonomously by the AI system by exploiting the underlying structure of the data and their relationships.

The technique usually consists in taking an input dataset and obscuring part of it. The SSL algorithm must then analyse the data that has remained visible in order to predict the hidden data (or certain properties of the hidden data). In doing so, it creates the labels that will enable it to learn.

Self-supervised learning has several benefits. The first is obvious, as data labelling is a major bottleneck for supervised learning.

In order to be efficient, machine learning algorithms (deep ones in particular) require huge amounts of data that have been selected and annotated by humans beforehand.

This is an extremely long and costly process. In some areas, such as medicine, which require specific expertise and where data is sometimes rare, it can be highly complex.

SSL makes it possible to avoid this obstacle, as the model can be trained on a large amount of data with no curation or manual labelling.

As emphasised by Facebook, this approach could also limit the coding of biases, which can occur at these stages, and sometimes improve labelling (in medical imaging for example).

Broadly speaking, SSL enables the AI community to work with larger and more diverse datasets as well as to create and deploy models faster.

Spectacular breakthroughs

Self-supervised approaches have enabled major advances in Natural Language Processing (NLP) where pretraining artificial neural networks on very large text corpora has led to breakthroughs in several areas, such as machine translation or question-response systems.

Word2Vec is a good example of the use of SSL. This family of word-embedding models developed by researchers at Google relies on two-layer artificial neural networks to represent words using vectors and attempt to predict a word based on its context (Continuous Bag of Words, CBOW) and vice-versa (Skip-gram model).

SSL has also made it possible to train a new generation of language models based on a Transformer architecture of which BERT, also developed by Google, is a good example.

Unlike Word2Vec, BERT is a contextual representation model. Thus, where Word2Vec generates a single vector for each word (the word “Orange” will have the same representation even though it can refer to a colour, a fruit, a city, or a company), BERT can generate several vectors according to the context in which the word is used.

First of all, the model is pretrained on huge unlabelled text datasets (all of the Wikipedia pages in English, for example).

Then it is fine-tuned, meaning it is retrained on a smaller quantity of data, for a specific task (such as sentiment analysis or text generation). This method makes it much more precise in its results and quicker to learn than its predecessors.

What’s more, it is capable of specialising in many tasks, with little data, and it outperforms existing specialised models in many cases. Thanks to the open version published by Google, BERT has given rise to several derivatives.

From natural language processing to computer vision

Self-supervised learning may have enabled natural language processing to make progress, however the techniques used cannot easily be transposed to new areas, such as computer vision.

Yann LeCun writes that this can be explained mainly by the fact that it is much more difficult to efficiently represent uncertainty in image prediction than it is in word prediction.

“When the missing word cannot be predicted exactly […], the system can associate a score or a probability to all possible words in the vocabulary […].” This is not possible with computer vision. “We cannot list all possible video frames and associate a score to each of them, because there is an infinite number of them.”

The ingredients of the SEER recipe

To solve the problem, Facebook has developed the SEER model, which combines several innovations concocted within its laboratories.

First ingredient: SwAV, developed in cooperation with Inria, is an online clustering algorithm, which benefits from contrasting methods to cluster images sharing visual characteristics without requiring explicit comparison of a multitude of image pairs.

Contrast learning makes it possible to train a model to recognise the similarities and differences between images – and thereby learn the invariable characteristics of an object – by comparing pairs of images that have been transformed or taken at different angles.

It is a very efficient method for learning visual concepts without supervision, but the comparison process requires an extremely high amount of calculation time, hence the search for an alternative. With SwAV, Facebook claims to have achieved good performances all the while dividing the model’s learning time by six.

However, if we wish to train a big model on large databases, it is also necessary to have the appropriate architecture. Facebook has turned to another recent innovation from its AI research laboratory FAIR: RegNets, a family of Convolutional Neural Networks (ConvNets) that can be scaled with billions of parameters and optimised to adapt to various runtime environments and memory limits.

The final ingredient that has made SEER possible: VISSL, a versatile open-source toolbox for self-supervised learning applied to images.

According to Facebook, the final version of the SEER model (1.3 billion parameters, 1 billion random images, 512 processors) has reached 84.2 % top-1 accuracy on ImageNet, the reference database where research teams from all over the world assess the accuracy of their models. This rate represents the proportion of correct predictions. “Top-1 accuracy” means that the first answer given by the model, the one with the highest probability, is indeed the expected answer (whereas “top-5 accuracy” takes into account the first five answers given by the model). The score obtained by SEER places it in the highest performing self-supervised models and not very far behind the best supervised models, which can achieve around 90.50 % top-1.