HDR de Carlos Crispim-Junior


Sujet :
Multimodal Computer Vision: exploiting domain knowledge for visual data generation and understanding

Résumé :

The past ten years were marked by advances in artificial intelligence and computer vision, first with the emergence of deep convolutional neural networks, then generative adversarial networks, and, more recently, multimodal transformers. This HDR document summarizes the investigations I have done with my collaborators during the last decade that brought forth contributions in three axes of research: dataset collection in multidisciplinary studies, generative models for computer vision with limited data, and multimodal architectures for computer vision and domain generalization. In dataset collection and multidisciplinary studies, we developed procedures to acquire data in real-world conditions for problems underrepresented by existing public repositories. We also pursued the parallel goal of collaborating with other disciplines to answer multidisciplinary research questions with the acquired data. For instance, we developed a framework to collect data about people's behaviors inside self-driving cars using a multidisciplinary approach. In the axis generative models for computer vision, we explored how to generate plausible synthetic images of the target objects for application problems where a limited quantity of data was available. For instance, we showed how to use empirical observations of the scene and knowledge about the sensor's optics to translate images between the RGB and the hyperspectral domains. Finally, in multimodal architectures for computer vision with domain generalization, we investigated methods that can learn visual representations from multimodal data of single or multiple source domains that are robust to out-of-distribution examples.  Within these three research axes, we explored domain knowledge as complementary information to facilitate representation learning, multimodal data alignment, domain generalization, and information fusion. We have investigated the proposed contributions in the context of various computer vision problems: human activity understanding in ambient assisted living scenarios and inside of self-driving cars, aerial image classification for epidemiology studies, and hyperspectral imaging for plant disease classification, to name a few. The HDR document finishes by presenting my research project in multimodal computer vision for the next five to ten years, where we will seek to develop methods and systems that can both see and interpret multimodal signals. We will investigate approaches that can learn shared and complementary regularities in multimodal signals and in data from different domains, with the goal of obtaining models that are potentially more accurate and that can generalize better to out-of-distribution data. Moreover, we will also study and develop approaches that can consider features from learning-driven visual representations, domain knowledge about a problem, and contextual cues about a scene to address the target problems. The contributions envisaged by this project will be investigated with data from practical applications of computer vision, as we believe they provide a more accurate picture of real-life problems and the limitations of current methods, and they can also have a quicker impact on industry and society.


Date de soutenance : lundi, 22 septembre, 2025

Jury :
M. Thierry CHATEAUProfesseur(e)Université Clermont AuvergneRapporteur(e)
M. Vincent FREMONTProfesseur(e)Centrale NantesRapporteur(e)
M. Nicolas THOMEProfesseur(e)Sorbonne UniversityRapporteur(e)
Mme Alice CAPLIERProfesseur(e)Grenoble INP - PhelmaExaminateur​(trice)
Mme Dima DAMENProfesseur(e)University of BristolExaminateur​(trice)
M. Francois BREMONDDirecteur(trice) de rechercheCentre Inria Université Côte d'AzurExaminateur​(trice)
M. Liming CHENProfesseur(e)Centrale LyonExaminateur​(trice)
Mme Laure TOUGNE RODETProfesseur(e)Université Lumière Lyon 2Examinateur​(trice)