Thesis of Pierre Faure--Giovagnoli

Domain Knowledge and Functions in Data Science - Application to Hydroelectricity Production

Defense date: 24/11/2023

Advisor: Vasile-Marian Scuturici
Coadvisor: Jean-Marc Petit


In this dissertation, we investigate the link between domain knowledge in the form of functions and data science. Consider the following scenario. Let D(y, z1, ...., zn) be a dataset, Alice a data scientist, Bob a domain expert and y=f(z1, ..., zn) a function known to Bob from his background knowledge. In this dissertation, we are interested in the following simple yet crucial questions for Alice. How can we define the satisfaction of f in D? How can we measure that satisfaction efficiently? How does this satisfaction relate to the supervised learning task of learning f from D? It turns out that these problems are related to the study of counterexamples through the use of functional dependencies (FDs) and, in particular, FD measures used to quantify their satisfaction in a dataset such as the g3 indicator. More specifically, we consider the case where the equality is replaced by more flexible predicates, a relaxation that is now common in the literature.

First, we examine the complexity of computing g3. It is known that g3 can be computed in polynomial time when using equality, while it becomes NP-hard when using general predicates. Our goal is to refine this dichotomy by studying the impact of the following common properties: reflexivity, transitivity, symmetry, and antisymmetry. We show that symmetry and transitivity together are sufficient to guarantee that the g3-error can be computed in polynomial time. However, removing one of them makes the problem NP-hard. Second, we study the computation of g3 in the polynomial and NP-hard cases identified above. We propose different exact and approximate solutions for the computation of g3 in both cases. We compare these solutions in a detailed experimental study of time performance and approximation accuracy.  All the algorithms are also made available via fastg3, a fast open-source Python library with an underlying C++ implementation. Third, we link counterexamples and g3 to supervised learning with a web application called ADESIT. ADESIT is intended to be part of an iterative data refinement process right after data selection and just before the machine learning process itself. It provides a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistics and visual exploration. Finally, we validate our approach by applying it to the industrial problem of air gap monitoring in compact hydro-generators, and develop a solution for automatically processing the recorded data. 

Mme. Amer-Yahia SihemDirecteur(trice) de rechercheUniversité Grenoble AlpesRapporteur(e)
M. Palpanas ThemisProfesseur(e)Université Paris CiteRapporteur(e)
M. Bozga MariusIngénieur(e) de rechercheUniversité Grenoble AlpesExaminateur​(trice)
Mme. Laforest FrédériqueProfesseur(e)INSA LyonExaminateur​(trice)
M. Senelart PierreProfesseur(e)École normale supérieureExaminateur​(trice)
M. Scuturici Vasile-MarianProfesseur(e)INSA LyonDirecteur(trice) de thèse
M. Petit Jean-MarcProfesseur(e)INSA LyonCo-directeur (trice)