Thesis of Farouk Damoun


Subject:
Enhancing Federated Learning for Financial Sector via Graph Learning and Language Models

Start date: 02/11/2021
End date (estimated): 02/11/2024

Advisor: Hamida Seba
Cotutelle: Radu State

Summary:

In the modern financial sector, the need for robust machine learning models is increasingly critical, yet privacy regulations and competitive concerns often make centralized data inaccessible. To overcome these challenges, this dissertation proposes several novel Federated Learning (FL) methodologies that enable institutions to collaboratively train models while addressing the critical trade-offs between privacy and data utility by integrating privacy-preserving mechanisms designed to prevent input recovery with minimal loss to data utility.

A key contribution of this research is the development of a federated learning framework for Privacy-Preserving Behavioral Anomaly Detection and fraud detection in financial transactions. By utilizing Graph Neural Networks (GNNs) on dynamic ego-centric graphs, the framework captures evolving transactional patterns to detect anomalies effectively, while preserving privacy. A novel domain-specific negative sampling technique enables model training without the need for labeled data from the federation participants, making it highly applicable in real-world scenarios. The results demonstrate that deep learning-based methods, particularly graph-level embedding, outperform traditional approaches in anomaly detection and improving fraud detection tasks, by introducing anonymization and noise-based mechanisms, even when the shared model gradients are exposed.

Additionally, we propose G-HIN2Vec, a graph-level embedding technique for heterogeneous information networks, which models individuals, such as cardholders, using static and dynamic ego-centric graphs. This method serves as an anonymization mechanism that eliminates the need for personally identifiable information (PII) in federated models. By integrating Personalized Local Differential Privacy (PLDP), we provide an additional layer of protection, ensuring that even in the event of a model breach, sensitive data remains secure.

Finally, the dissertation introduces the Federated Byte-Level Byte Pair Encoding (BPE) Tokenizer, a novel privacy-preserving tokenization approach designed for distributed textual datasets. This tokenizer outperform existing models in vocabulary coverage and efficiency, while maintaining rigorous data privacy. Our federated tokenizer not only competes with centralized models but also demonstrates improvements in both text compression and privacy preservation, for both general and domain-specific tokenizers.

The methodologies presented in this dissertation, validated through real-world transaction and textual financial datasets, highlight the potential of federated learning to enhance fraud detection and language model performance while preserving privacy of individuals and institutions through anonymization and noise based privacy mechanisms.