Thesis of Jocelyn Bernard

Manage and analyze large data graphs of named entities

Defense date: 31/12/2019

Advisor: Hamamache Kheddouci


This PhD project focuses on the development of methods, algorithms and heuristics for the generation, management, exploration and analysis of large data graphs of named entities obtained from text data (hypothesis). This body of hypothesis is regularly supplied by new hypothesis. The hypothesis intersect between them and denote several types of interactions between the named entities. Consequently, large graphs that will be studied are dynamic and means to reach billion of nodes.
Our goal at first is to construct a graph grammar that can generate graphs of data from this set of hypotheses. These graphs can be (multi-)graphs, directed or weighted at nodes and / or arcs. They may contain an extremely large number of nodes (in the order of billion of nodes). Moreover, these graphs are dynamic and change with the arrival of new hypotheses in the corpus.
Secondly, we shall have to propose efficient algorithms for managing these graphs with the aim their scalability. In particular, we will focus on two important issues: (1) the graph partitioning and (2) the distribution of graphs on distributed architectures. The graph partitioning problem is NP-hard even for particular classes of graphs. Furthermore, methods of partitioning in existing platforms of large graphs are basic and independent of the considered applications. We will have in this thesis to develop efficient heuristics able to partition our graphs with specific constraints linked to types of queries that we want treat and to the constraints to the platform as load balancing, synchronization, fault tolerance, etc. The obtained partition will then be distributed on a architecture as Hadoop / MapReduce. A distribution policy is necessary for the exploration of the graph and its analysis do effectively and with a reasonable time.
Finally, the last step of the thesis is devoted to the analysis of these large graphs. he goal is to use advanced techniques in graph theory and algorithms to explore, analyze and reason about the large graphs of named entities. In particular, we will study the dynamics of these graphs to understand its evolution, detect possible change, its causes, impact, etc.