Thèse de Pierre Faure--Giovagnoli

Thesis of Pierre Faure--Giovagnoli

Subject:

Domain Knowledge and Functions in Data Science - Application to Hydroelectricity Production

Start date: 01/09/2020
Defense date: 24/11/2023

Advisor: Vasile-Marian Scuturici
Coadvisor: Jean-Marc Petit

Summary:

In this dissertation, we investigate the link between domain knowledge in the form of functions and data science. Consider the following scenario. Let D(y, z1, ...., zn) be a dataset, Alice a data scientist, Bob a domain expert and y=f(z1, ..., zn) a function known to Bob from his background knowledge. In this dissertation, we are interested in the following simple yet crucial questions for Alice. How can we define the satisfaction of f in D? How can we measure that satisfaction efficiently? How does this satisfaction relate to the supervised learning task of learning f from D? It turns out that these problems are related to the study of counterexamples through the use of functional dependencies (FDs) and, in particular, FD measures used to quantify their satisfaction in a dataset such as the g₃ indicator. More specifically, we consider the case where the equality is replaced by more flexible predicates, a relaxation that is now common in the literature.

First, we examine the complexity of computing g₃. It is known that g₃ can be computed in polynomial time when using equality, while it becomes NP-hard when using general predicates. Our goal is to refine this dichotomy by studying the impact of the following common properties: reflexivity, transitivity, symmetry, and antisymmetry. We show that symmetry and transitivity together are sufficient to guarantee that the g₃-error can be computed in polynomial time. However, removing one of them makes the problem NP-hard. Second, we study the computation of g₃ in the polynomial and NP-hard cases identified above. We propose different exact and approximate solutions for the computation of g3 in both cases. We compare these solutions in a detailed experimental study of time performance and approximation accuracy. All the algorithms are also made available via fastg3, a fast open-source Python library with an underlying C++ implementation. Third, we link counterexamples and g₃ to supervised learning with a web application called ADESIT. ADESIT is intended to be part of an iterative data refinement process right after data selection and just before the machine learning process itself. It provides a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistics and visual exploration. Finally, we validate our approach by applying it to the industrial problem of air gap monitoring in compact hydro-generators, and develop a solution for automatically processing the recorded data.

Jury:

Mme. Amer-Yahia Sihem	Directeur(trice) de recherche	Université Grenoble Alpes	Rapporteur(e)
M. Palpanas Themis	Professeur(e)	Université Paris Cite	Rapporteur(e)
M. Bozga Marius	Ingénieur(e) de recherche	Université Grenoble Alpes	Examinateur(trice)
Mme. Laforest Frédérique	Professeur(e)	INSA Lyon	Examinateur(trice)
M. Senelart Pierre	Professeur(e)	École normale supérieure	Examinateur(trice)
M. Scuturici Vasile-Marian	Professeur(e)	INSA Lyon	Directeur(trice) de thèse
M. Petit Jean-Marc	Professeur(e)	INSA Lyon	Co-directeur (trice)