Thesis of Alexandre Devillers
Subject:
Start date: 01/10/2021
End date (estimated): 01/10/2024
Advisor: Mathieu Lefort
Summary:
Representation learning has become a central pillar of modern artificial intelligence, playing a key role in recent advances in fields such as computer vision and natural language processing. With the growing interest in self-supervised learning, where models independently learn from raw data without human supervision, representation learning has become even more critical. It enables making sense of raw data by extracting relevant features. Furthermore, this autonomous framework promotes the learning of more general representations due to the absence of specific labeling—making them task-agnostic of downstream tasks—while leveraging large quantities of raw data. However, the challenge lies in identifying a supervisory signal, accessible solely from input data, but sufficiently relevant to structure general representations that perform well on downstream tasks. Recent methods for self-supervised visual representation learning employ instance discrimination as a pretext task, demonstrating strong potential to generate rich, reusable, and transferable representations for a wide range of downstream tasks, sometimes even surpassing supervised approaches. The principle of instance discrimination is based on the idea that similar inputs should be projected to similar points in the representation space. In practice, this is typically achieved through a Siamese architecture, which processes two augmented views of the same input using identical networks. These views are generated in a self-supervised manner by applying transformations—also known as augmentations—to the same image, creating pairs that are semantically similar but visually distinct. The pretext task then aims to align the outputs of the two views, encouraging the network to build representations invariant to augmentations, thereby emphasizing the learning of shared visual patterns between the views. This learning approach, grounded in a pretext task designed to capture invariance, differs from historical methods, such as reconstruction-based approaches, which aim to reconstruct an image from its representation. Instance discrimination focuses on a structure-oriented objective, and the success of these approaches highlights the importance of exploring the structural properties of learned representations, not merely as a practical tool for designing pretext tasks but as a direct means of improving their quality. This thesis adopts this perspective, exploring how the structure of representations—notably invariance, sensitivity, and equivariance—can be leveraged to improve generalization in visual representation learning. This issue is addressed through specific sub-questions, each linked to a contribution of the thesis. These sub-questions examine structure through various approaches, such as modifying data distribution, incorporating variational aspects, utilizing equivariance, or analyzing correlations between performance and structural sub-properties. This work has underscored that the structure of representations plays a significant role in generalization and demonstrated that it is an effective lever for improving performance.
Jury:
Jochen Triesch | Professeur(e) | Rapporteur(e) | |
Frédéric Jurie | Professeur(e) | Rapporteur(e) | |
Ishan Misra | Chercheur | Examinateur(trice) | |
Céline Hudelot | Professeur(e) | Examinateur(trice) | |
Raphaëlle Chaine | Professeur(e) | Examinateur(trice) | |
Mathieu Lefort | Maître de conférence | Encadrant(e) |