Thesis of Thi Hoa Le


Subject:
Enhancing Data Quality and Fairness in Data for Federated Learning-Based Systems: A Human-Guided Approach to Distributed Data Curation

Start date: 10/03/2025
End date (estimated): 10/03/2028

Advisor: Angela Bonifati
Coadvisor: Andrea Mauri

Summary:

The growing adoption of federated learning (FL) in has brought significant opportunities for data-driven decision-making while preserving data privacy. However, the success of FL hinges on the quality and fairness of the underlying data. Heterogeneous, distributed datasets often suffer from poor data quality, imbalance, and biases, which can undermine the robustness and equity of healthcare solutions. Addressing these challenges is critical to ensuring that FL-based systems provide reliable and equitable outcomes.

The first component of this work is a comprehensive characterization of data quality, imbalance, and heterogeneity, as this can lead to significant inefficiencies and biased outcomes, disproportionately affecting underrepresented or marginalized populations. We conduct an empirical study to evaluate distributed datasets from both project-specific use cases and publicly available sources. This study investigates how inconsistencies such as missing, dirty, or erroneous data exacerbate biases, particularly when such issues arise systematically or non-randomly. Sensitive attributes like race, gender, or socioeconomic status are analyzed alongside indirect indicators, such as zip codes, which may unintentionally propagate discrimination. Using established bias metrics, such as statistical parity difference and equal opportunity difference, the analysis provides a nuanced understanding of bias and data quality, establishing a baseline for subsequent tasks.

Building on this foundation, the second component focuses on developing methods for data curation tailored to FL systems. Traditional curation approaches, while efficient, often fail to address complex errors that require domain knowledge or contextual understanding. In distributed settings, where data resides on multiple FL clients and is typically non-independent and identically distributed (non-IID), the challenge becomes even more pronounced. This work introduces human-guided curation algorithms that integrate domain expertise into the data repair process. These algorithms adopt an active learning framework to balance the cost and availability of engaging domain experts with the need for accurate repairs. By combining human insight with machine intelligence, the algorithms dynamically adapt to different levels of expertise and repair requirements.