Thèse de Fatima El Hattab

Sujet :
Towards Effective Privacy Preserving Decentralized AI

Résumé :

The number of edge devices and the data produced by them has grown tremendously over the last ten years. While in 2009 mobile phones only generated 0.7% of worldwide data traffic, in 2018 this number exceeded 50% [4]. In addition, user’s privacy is an important issue since privacy-related scandals (e.g., PRISM or Cambridge Analytica) continue to unfold [2][3], and new regulations come into force (e.g., the EU’s GDPR). In this context, big industrial players are now seeking to exploit the rising power of edge devices to provide AI-as-a-Service and Edge-AI, i.e., Artificial Intelligence services that reduce the demand on server/cloud infrastructures while protecting users’ privacy. This new computing paradigm is known as Federated Learning (FL) [5][6]. Roughly speaking, Federated Learning offloads cloud storage and computation costs of AI and ML applications onto client devices by training a global model on decentralized data stored locally at the client side.

Therefore, Federated Learning opens interesting perspectives in privacy sensitive domains, such as healthcare or user mobility, that were so far reluctant towards AI and machine learning techniques. Indeed, with such decentralized Federated Learning protocols, data is kept private at the client side, instead of sending it to a remote service/cloud as done in classical approaches. However, Federated Learning unveils a brand new set of challenges. Recent studies show that Federated Learning is vulnerable to malicious users participating to the distributed protocol, if such users perform data poisoning attacks in order to make the AI and global model deviate from its correct behavior [2][7][11]. Malicious users do not rigorously follow the protocol, either innocently, due to human or system errors, or intentionally, due to adversarial behaviors. Such behaviors may end up, for instance, with disease data mislabelling in digital healthcare systems, wrong radiation information in radiation detection systems, or (un)intentionally biased data in open data systems.

The state-of-the-art approaches to tackle malicious clients in classical distributed machine learning make assumptions that do not hold in the case of decentralized Federated Learning systems, such as the fact that clients’ data are identically distributed among clients and independent from each other

[10]. However, data present on client devices are collected by the clients themselves, based on thclients’ own usage pattern and local environment. Both the size and the distribution of clients’ data heavily vary between different clients. Thus, there is a need for novel algorithms and techniques to efficiently detect data poisoning attacks and counter them in Federated Learning systems.

The research objective of this PhD project is to derive novel Federated Learning protocols that are resilient to data poisoning attacks. The key tasks of this project are:
(i) Exploring different types of data poisoning attacks in Federated Learning, under different use cases, such as disease data mislabelling in digital healthcare systems, or (un)intentionally biased data in open data systems.
(ii) Deriving various data poisoning attack implementations (e.g., data label poisoning, data feature poisoning) in real-world datasets, and proposing detection mechanisms based on techniques such as generative adversarial networks (GA Ns) [8], model output and gradient monitoring, etc.
(iii) Designing and experimenting a wide range of defense approaches and hybrid protocols, such as software and hardware-based protocols combining decentralized protocols with trusted hardware execution environments such as Intel SGX [9] and ARM TrustZone [1].

Encadrant : Sara Bouchenak
Co-encadrant : Vlad Nitu