Thesis of Rémy Delanaux


Subject:
Privacy-Preserving Linked Data Integration

Defense date: 13/12/2019

Advisor: Angela Bonifati
Coadvisor: Romuald Thion

Summary:

INDIVIDUAL privacy is a major and largely unexplored concern when publishing new
datasets in the context of Linked Open Data (LOD). The LOD cloud forms a net-
work of interconnected and publicly accessible datasets in the form of graph databases
modeled using the RDF format and queried using the SPARQL language. This heavily
standardized context is mostly used by academics and public institutions, as industrial
and private actors may be discouraged by potential privacy issues.

To this end, we introduce and develop a declarative framework for privacy-
preserving Linked Data publishing in which privacy and utility constraints are speci-
fied as policies, that is sets of SPARQL queries. Our approach is data-independent and
leads to inspect only the privacy and utility policies in order to determine the sequence
of anonymization operations applicable to any graph instance for satisfying the poli-
cies. We prove the soundness of our algorithms and gauge their performance through
experimental analysis.

Another aspect to take into account is that a new dataset published to the LOD cloud
is indeed exposed to privacy breaches due to the possible linkage to objects already ex-
isting in the other LOD datasets. In the second part of this thesis, we thus focus on the
problem of building safe anonymizations of an RDF graph to guarantee that linking
the anonymized graph with any external RDF graph will not cause privacy breaches.
Given a set of privacy queries as input, we study the data-independent safety prob-
lem and the sequence of anonymization operations necessary to enforce it. We provide
sufficient conditions under which an anonymization instance is safe given a set of pri-
vacy queries. Additionally, we show that our algorithms for RDF data anonymization
are robust in the presence of sameAs links that can be explicit or inferred by additional
knowledge.

To conclude, we evaluate the impact of this safety-preserving solution on given
input graphs through experiments. We focus on the performance and the utility loss
on this anonymization framework on both real-world and artificial RDF data. We first
discuss and select utility measures to compare the original graph to its anonymized
counterpart, then define a method to generate new privacy policies from a reference one
by inserting incremental modifications. We finally study the behavior of the framework
on four carefully selected RDF graphs of various sizes and structure. We show that our
anonymization technique is effective with reasonable runtime on quite large graphs
(several million triples) and is progressive: the more specific the privacy policy is, the
lesser its impacts are. Finally, we evaluate structural graph-based measurements and
analyze their relevance.
This new approach paves the way to many extensions and, in the long run, to more
work on privacy-preserving data publishing in the context of the Semantic Web and
of Linked Open Data, by designing a simple and efficient way to ensure privacy and
utility in plausible usages of RDF graphs.


Jury:
M. Anciaux NicolasDirecteur(trice) de rechercheINRIA SaclayExaminateur​(trice)
Mme Bonifati AngelaProfesseur(e)Université Lyon 1Co-directeur (trice)
M. Kheddouci Hamamache Professeur(e)Université Lyon 1Président(e)
M. Nguyen BenjaminProfesseur(e)INSA Centre-Val de LoireRapporteur(e)
Mme Rousset Marie-ChristineProfesseur(e)Université Grenoble AlpesCo-directeur (trice)
Mme Skaf-Molli HalaMaître de conférenceUniversité de NantesRapporteur(e)
M. Thion RomualdMaître de conférenceUniversité Lyon 1Co-encadrant(e)