Thèse de Pierre-Yves Genest

Thesis of Pierre-Yves Genest

Subject:

Unsupervised Open-World Information Extraction From Unstructured and Domain-Specific Document Collections

Start date: 01/12/2021
Defense date: 09/12/2024

Advisor: Elod Egyed-Zsigmond

Summary:

The exponential growth in data generation has rendered the effective analysis of unstructured textual document collections a critical challenge. This PhD thesis aims to address this challenge by focusing on Information Extraction (IE), which encompasses four essential tasks: Named Entity Recognition (NER), Coreference Resolution (CR), Entity Linking (EL), and Relation Extraction (RE). These tasks collectively enable extracting and structuring knowledge from unformatted documents, facilitating its integration into structured databases for further analytical processes.

Our contributions start with creating Linked-DocRED, the first large-scale, diverse, and manually annotated dataset for document-level IE. This dataset enriches the existing DocRED dataset with high-quality entity linking labels. Additionally, we propose a novel set of metrics for evaluating end-to-end IE models. The evaluation of baseline models on Linked-DocRED highlights the complexities and challenges inherent to document-level IE: cascading errors, long context handling, and information scarcity.

We then introduce PromptORE, an unsupervised and open-world RE model. Adapting the prompt-tuning paradigm, PromptORE achieves relation embedding and clustering without requiring fine-tuning or hyperparameter tuning (a major weakness of previous baselines) and significantly outperforms state-of-the-art models. This method demonstrates the feasibility of extracting semantically coherent relation types in an open-world context.

Further extending our prompt-based approach, we develop CITRUN for unsupervised and open-world NER. By employing contrastive learning with off-domain labeled data, CITRUN improves entity type embeddings, surpassing LLM-based unsupervised NERs, and achieving competitive performance against zero-shot models that are more supervised.

These advancements facilitate meaningful knowledge extraction from unstructured documents, addressing practical, real-world constraints and enhancing the applicability of IE models in industrial contexts.

Jury:

Mme Gianini Gabriele	Professeur(e)	Università degli Studi di Milano-Bicocca	Rapporteur(e)
M. Granitzer Michael	Professeur(e)	Universität Passau	Rapporteur(e)
Mme Calabretto Sylvie	Professeur(e)	LIRIS INSA Lyon	Examinateur(trice)
Mme Mothe Josiane	Professeur(e)	INSPÉ Toulouse Occitanie-Pyrénées	Examinateur(trice)
M. Egyed-Zsigmond Előd	Professeur(e) associé(e)	LIRIS INSA Lyon	Directeur(trice) de thèse