Thesis of Loujain Liekah

Entity Matching and Clustering Multivariate Longitudinal Data - Application on Patient Subtyping

Defense date: 29/09/2023

Advisor: Mohand-Said Hacid
Coadvisor: Haytham Elghazel
Codirection: Fabien De Marchi


Data integration is used to expand the information available for subsequent analysis by linking multiple sources. For example, linking information of blood tests and clinical assessments of patients enriches the data, and ensures a more comprehensive input for further applications. The main task in data integration is to identify multiple representations of the same real-world object or subject, which is referred to as entity matching (EM). Most EM solutions rely on either training a binary classifier or clustering a similarity graph generated using textual similarity measures. However, labeled data for EM is scarce, and real-world data is heterogeneous, containing not only textual attributes but also numerical, categorical, and boolean attributes.

When the linked data refers to the same set of subjects at different points in time, it constitutes longitudinal data. Analysing longitudinal data allows for understanding the changes in observations over time for a set of subjects. Longitudinal data clustering can identify groups of subjects sharing similar characteristics over time. However, applying traditional clustering algorithms repeatedly on the data is inefficient and requires additional efforts to interpret the significance of the results. Furthermore, most current longitudinal clustering algorithms are either univariate, i.e., analyze only one single behavioral variable over time, or model-based, i.e., specific to certain distributions, limiting their generalizability across dffierent data sets. While data stream methods offer potential solutions for longitudinal clustering, they are constrained by many user-defined parameters and lack focus on the subject of interest.

Our study addresses two key challenges: i. performing entity matching to link data sources with heterogeneous attribute types without labelled data, and ii. clustering multivariate longitudinal data dynamically to identify patterns over different assessment times. The solution to these challenges is twofold: First, we develop an unsupervised framework called Deduplication over Heterogeneous Attribute Types (D-HAT), which effectively performs entity matching on data sets with high dimensionality, missing values, and diverse attribute types. D-HAT produces state-of-the-art results over different benchmark and real-world data sets. Second, we design a dynamic algorithm for clustering multivariate longitudinal data. This approach leverages entity matching to find and link similar clusters across different assessment times, enabling the identification of temporal patterns and trajectories. Our proposed method provides transparency for medical applications such as patient subtyping and disease progression modelling. We validate our approach using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) real-world data set, demonstrating its effectiveness in identifying subtypes and detecting early signs of dementia.

Keywords: data integration, entity matching, longitudinal data, unsupervised ma- chine learning, multivariate clustering, patient subtyping, disease progression modeling, Alzheimer’s disease

Mme ZEITOUNI KarineProfesseur(e) Université de Versailles SaintQuentin-en-YvelinesRapporteur(e)
Mme AZZAG HaneneMaître de conférenceUniversité Paris 13 Rapporteur(e)
M. MEPHU NGUIFO EngelbertProfesseur(e)Université Clermont AuvergneExaminateur​(trice)
Mme SEBA HamidaMaître de conférenceUniversité Lyon 1Examinateur​(trice)
M. HACID Mohand-Saïd Professeur(e)Université Lyon 1Directeur(trice) de thèse
M ELGHAZEL HaythamMaître de conférenceUniversité Lyon 1Co-directeur (trice)
M. DE MARCHI Fabien Maître de conférenceUniversité Lyon 1Co-encadrant(e)