Thèse de Brandon Mosqueda
Sujet :
Date de début : 24/11/2023
Date de fin (estimée) : 24/11/2026
Encadrant : Omar Hasan
Co-direction : Lionel Brunie
Résumé :
The increasing volume of data has driven the adoption of machine learning techniques, but concerns about data privacy have emerged. Privacy-preserving machine learning was proposed to address these challenges, yet current solutions face issues like reliance on a central party, data leakage risks, high computational costs, and reduced model utility.
Federated learning is a commonly used approach that enables collaborative model development without sharing raw data, but it relies on a central server, leading to communication bottlenecks and privacy risks. To overcome these limitations, fully decentralized learning emerged, allowing devices to communicate directly for model updates and aggregation. However, recent studies [1, 2, 3] have shown that raw data can still be re- constructed from shared model updates. Privacy-preserving machine learning now focuses on protecting model updates from malicious users.
This research proposal aims to develop a decentralized machine learning framework that strikes a better balance between privacy, efficiency, and model utility. It will explore the application of Secure Multiparty Computation (SMC) for privacy during model aggregation and efficient communication topologies.
State of the art
Federated learning is a framework that enables training a machine learning model directly on remote devices, without sharing data. It improves data privacy and allows training on large-scale datasets by distributing computational operations among devices. However, it relies on a central coordinator server that becomes a bottleneck due to managing training and model aggregation processes [4]. To address this, decentralized learn- ing was introduced, where devices communicate directly with neighboring devices, sharing model updates and distributing the aggregation task.
Decentralized learning offers better privacy guarantees [5] and higher efficiency than federated learning [6]. However, studies like [2] have shown that malicious devices can reconstruct training data from model updates. Initial proposals focused on improving efficiency and utility, while subsequent studies aimed to protect model parameters shared among users, leading to trade-offs between privacy, utility, and efficiency.
Increasing privacy comes with a cost in efficiency or even in utility, that is the reason why [7], after extensive experiments of different novel attacks on modern decentralized frameworks, mention that contrary to what is claimed, current decentralized proposals do not offer any security advantage over federated learning. Strong privacy guarantees in descentralized learning would require denser connected networks, losing any practical advantage over the federated setup in large scales, and therefore completely defeating the objective of the de- centralized approach.
In [8], for example, a secure agregation protocol is proposed for decentralized learning. The protocol solves the privacy problem for the honest-but-curious scenario even with n − 2 malicious users, but suffers of not being useful in large scale scenarios because practically emulates the topology of federated learning and overloads one of the nodes.
Scientific objectives
Current decentralized learning proposals neglect one of three discussed properties doing federated learning more practical when privacy is a big concern in large scale scenarios, but a decentralized framework that maintains a good balance of them is of great importance because it would allow maximizing the advantages of full decentralization such as not relying on central trusted server, paralellization, fault tolerance and scalability. This thesis will research how decentralized learning can be improved in such a way that a better trade-off between privacy, utility, and efficiency is achieved. The scientific objectives of this work are:
Privacy: Explore the application of efficient Secure Multiparty Computation algorithms to ensure privacy during model aggregation.
Efficiency: Investigate the impact of topology on efficiency, model convergence and communication costs.
Utility: Integrate the privacy and effiency solutions in a framework without affecting model utility, especially by the non-independent and identically distributed (non-IID) data problem.
Fault tolerance: Study the fault tolerance of the framework, including scenarios such as node drop-out during training and high latency in communication.
Methodology
The research will begin with a comprehensive literature review to understand existing decentralized learning approaches and privacy-preserving techniques, while identifying gaps in the field. Based on these findings, a decentralized machine learning framework will be designed, emphasizing a trade-off between privacy, utility, and efficiency, while considering challenges in large-scale scenarios, non-IID data, and fault tolerance. The framework will incorporate Secure Multiparty Computation (SMC) for privacy-preserving model aggregation and may explore blockchain technology.
Experimental evaluations will be conducted, using benchmark datasets and metrics for privacy preserva- tion, model utility, convergence speed, and communication overload. The results will be compared to existing approaches, and conclusions will be drawn, discussing the strengths, limitations, and potential future research directions, including advanced privacy-preserving techniques, scalability, and addressing other challenges in decentralized machine learning.
References
[1] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients – how easy is it to break privacy in federated learning?” 2020. [Online]. Available: https://arxiv.org/abs/2003.14053
[2] F. Boenisch, A. Dziedzic, R. Schuster, A. S. Shamsabadi, I. Shumailov, and N. Papernot, “When the curious abandon honesty: Federated learning is not private,” 2021. [Online]. Available: https://arxiv.org/abs/2112.02918
[3] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond inferring class representatives: User-level privacy leakage from federated learning,” 2018. [Online]. Available: https://arxiv.org/abs/1812.00535
[4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” 2016. [Online]. Available: https://arxiv.org/abs/1602.05629
[5] E. Cyffers and A. Bellet, “Privacy amplification by decentralization,” 2020. [Online]. Available: https://arxiv.org/abs/2012.05326
[6] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” 2017. [Online]. Available: https://arxiv.org/abs/1705.09056
[7] D. Pasquini, M. Raynal, and C. Troncoso, “On the privacy of decentralized machine learning,” 2022. [Online]. Available: https://arxiv.org/abs/2205.08443
[8] A.-T. Tran, T.-D. Luong, J. Karnjana, and V.-N. Huynh, “An efficient approach for privacy preserving decentralized deep learning models based on secure multi-party computation,” Neurocomputing, vol. 422, pp. 245–262, 2021. [Online]. Available: https://doi.org/10.1016/j.neucom.2020.10.014