Thesis of Loujain Liekah
Subject:
Defense date: 29/09/2023
Advisor: Mohand-Said Hacid
Coadvisor: Haytham Elghazel
Codirection: Fabien De Marchi
Summary:
Data integration is used to expand the information available for subsequent analysis by linking multiple sources. For example, linking information of blood tests and clinical assessments of patients enriches the data, and ensures a more comprehensive input for further applications. The main task in data integration is to identify multiple representations of the same real-world object or subject, which is referred to as entity matching (EM). Most EM solutions rely on either training a binary classifier or clustering a similarity graph generated using textual similarity measures. However, labeled data for EM is scarce, and real-world data is heterogeneous, containing not only textual attributes but also numerical, categorical, and boolean attributes.
When the linked data refers to the same set of subjects at different points in time, it constitutes longitudinal data. Analysing longitudinal data allows for understanding the changes in observations over time for a set of subjects. Longitudinal data clustering can identify groups of subjects sharing similar characteristics over time. However, applying traditional clustering algorithms repeatedly on the data is inefficient and requires additional efforts to interpret the significance of the results. Furthermore, most current longitudinal clustering algorithms are either univariate, i.e., analyze only one single behavioral variable over time, or model-based, i.e., specific to certain distributions, limiting their generalizability across dffierent data sets. While data stream methods offer potential solutions for longitudinal clustering, they are constrained by many user-defined parameters and lack focus on the subject of interest.
Our study addresses two key challenges: i. performing entity matching to link data sources with heterogeneous attribute types without labelled data, and ii. clustering multivariate longitudinal data dynamically to identify patterns over different assessment times. The solution to these challenges is twofold: First, we develop an unsupervised framework called Deduplication over Heterogeneous Attribute Types (D-HAT), which effectively performs entity matching on data sets with high dimensionality, missing values, and diverse attribute types. D-HAT produces state-of-the-art results over different benchmark and real-world data sets. Second, we design a dynamic algorithm for clustering multivariate longitudinal data. This approach leverages entity matching to find and link similar clusters across different assessment times, enabling the identification of temporal patterns and trajectories. Our proposed method provides transparency for medical applications such as patient subtyping and disease progression modelling. We validate our approach using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) real-world data set, demonstrating its effectiveness in identifying subtypes and detecting early signs of dementia.
Keywords: data integration, entity matching, longitudinal data, unsupervised ma- chine learning, multivariate clustering, patient subtyping, disease progression modeling, Alzheimer’s disease
Jury:
Mme ZEITOUNI Karine | Professeur(e) | Université de Versailles SaintQuentin-en-Yvelines | Rapporteur(e) |
Mme AZZAG Hanene | Maître de conférence | Université Paris 13 | Rapporteur(e) |
M. MEPHU NGUIFO Engelbert | Professeur(e) | Université Clermont Auvergne | Examinateur(trice) |
Mme SEBA Hamida | Maître de conférence | Université Lyon 1 | Examinateur(trice) |
M. HACID Mohand-Saïd | Professeur(e) | Université Lyon 1 | Directeur(trice) de thèse |
M ELGHAZEL Haytham | Maître de conférence | Université Lyon 1 | Co-directeur (trice) |
M. DE MARCHI Fabien | Maître de conférence | Université Lyon 1 | Co-encadrant(e) |