Thesis of Jean-Baptiste Guimbaud


Subject:
Enhancing Environmental Risk Scores with Informed Machine Learning and Explained AI

Defense date: 11/10/2024

Advisor: Rémy Cazabet

Summary:

From conception onward, environmental factors such as air quality or dietary habits can significantly impact the risk of developing various chronic diseases. Within the epidemiological literature, indicators known as Environmental Risk Scores (ERSs) are used not only to identify individuals at risk but also to study the relationships between environmental factors and health. A limit of most ERSs is that they are expressed as linear combinations of a limited number of factors. This doctoral thesis aims to develop ERS indicators able to investigate nonlinear relationships and interactions across a broad range of exposures while discovering actionable factors to guide preventive measures and interventions, both in adults and children.

To achieve this aim, we leverage the predictive abilities of non-parametric machine learning methods, combined with recent Explainable AI tools and existing domain knowledge. In the first part of this thesis, we compute machine learning-based environmental risk scores for mental, cardiometabolic, and respiratory general health for children. On top of identifying nonlinear relationships and exposure-exposure interactions, we identified new predictors of disease in childhood. The scores could explain a significant proportion of variance and their performances were stable across different cohorts.

In the second part, we propose SEANN, a new approach integrating expert knowledge in the form of Pooled Effect Sizes (PESs) into the training of deep neural networks for the computation of informed environmental risk scores. SEANN aims to compute more robust ERSs, generalizable to a broader population, and able to capture exposure relationships that are closer to evidence known from the literature. We experimentally illustrate the approach's benefits using synthetic data, showing improved prediction generalizability in noisy contexts (i.e., observational settings) and improved reliability of interpretation using Explainable Artificial Intelligence (XAI) methods compared to an agnostic neural network.

In the last part of this thesis, we propose a concrete application for SEANN using data from a cohort of Spanish adults. Compared to an agnostic neural network-based ERS, the score obtained with SEANN effectively captures relationships more in line with the literature-based associations without deteriorating the predictive performances. Moreover, exposures with poor literature coverage significantly differ from those obtained with the agnostic baseline method with more plausible directions of associations.

In conclusion, our risk scores demonstrate substantial potential for the data-driven discovery of unknown nonlinear environmental health relationships by leveraging existing knowledge about well-known relationships. Beyond their utility in epidemiological research, our risk indicators are able to capture holistic individual-level non-hereditary risk associations that can inform practitioners about actionable factors in high-risk individuals. As in the post-genetic era, personalized medicine prevention will focus more and more on modifiable factors, we believe that such approaches will be instrumental in shaping future healthcare paradigms.


Jury:
Mme Bringay SandraProfesseur(e)Université Paul Valéry - MontpellierRapporteur(e)
Mme Tangaro SabinaProfesseur(e) associé(e)Université de Bari Aldo Moro - Bari, ItalieRapporteur(e)
M. Hacid Mohand-SaïdProfesseur(e)LIRIS - Université Claude Bernard Lyon 1Examinateur​(trice)
Mme Siroux ValérieDirecteur(trice) de rechercheINSERM - Université Grenoble AlpesExaminateur​(trice)
M. Cazabet RémyMaître de conférenceLIRIS - Université Claude Bernard Lyon 1Directeur(trice) de thèse
Mme Maître LéaMaître de conférenceUniversité Pompeu Fabra - Barcelone, EspagneDirecteur(trice) de thèse
M. Plantevit MarcProfesseur(e)EPITA Research Laboratory - Kremlin-BicêtreInvité(e)