Thèse de Mary Cerullo

Thesis of Mary Cerullo

Subject:

Extraction, structuring, and integration of heterogeneous data using a human-machine approach

Start date: 04/05/2026
End date (estimated): 04/05/2029

Advisor: Angela Bonifati

Summary:

This research topic focuses on the exploitation of heterogeneous scientific and industrial data, drawn from both the scientific literature (articles, reports, PDF documents) and existing datasets. The primary objective is to design and evaluate robust methods for extracting, structuring, and integrating complex data by combining automated approaches based on artificial intelligence with expert human validation.
The first research challenge concerns the reliable extraction of information from unstructured or weakly structured sources (PDFs, graphs, tables). The work will focus on developing extraction pipelines based on OCR, entity recognition, and relation recognition techniques, as well as on the use of LLM and RAG models to dynamically query documents, identify relevant passages, and assist in the reformulation or completion of information. Particular attention will be paid to traceability mechanisms and alignment with original sources to ensure the auditability of extracted data. The scientific challenge lies in designing extraction systems capable of adapting to heterogeneous documents while integrating human oversight to control quality and limit automatic errors.

The second challenge concerns the standardization and structuring of the extracted data in the form of a property graph. The work will aim to define methods for the systematic transformation of textual and tabular data into a formal schema, relying on PG-schema and PG-keys mechanisms to ensure the consistency, uniqueness, and integrity of entities and relationships. This includes the semantic alignment of entities, the normalization of attributes (types, units, formats), and the integration of numerical data from structured tables as attributes or explicit relationships within the graph.

Finally, the third research focus concerns the integration and reconciliation of multiple datasets within a centralized property graph. Research will focus on identifying and merging entities representing the same real-world object, as well as on selecting and implementing integration strategies such as Global-as-View or Local-as-View. An original focus of the work will be the extension of view maintenance and trigger mechanisms for property graphs, to ensure the graph’s consistency as data and sources evolve. Finally, the third research focus concerns the integration and reconciliation of multiple datasets within a centralized property graph. The research will focus on identifying and merging entities that represent the same real-world object, as well as on selecting and implementing integration strategies such as Global-as-View or Local-as-View. This research topic thus contributes to the advancement of methods for managing complex data at the interface between artificial intelligence, human-computer interaction, and databases, with direct implications for the scientific and industrial exploitation of sensitive and heterogeneous data.