Thesis of Pierre-Yves Genest


Subject:
Unsupervised Open-World Information Extraction From Unstructured and Domain-Specific Document Collections

Start date: 01/12/2021
End date (estimated): 01/12/2024

Advisor: Elod Egyed-Zsigmond

Summary:

The exponential growth in data generation has rendered the effective analysis of unstructured textual document collections a critical challenge. This PhD thesis aims to address this challenge by focusing on Information Extraction (IE), which encompasses four essential tasks: Named Entity Recognition (NER), Coreference Resolution (CR), Entity Linking (EL), and Relation Extraction (RE). These tasks collectively enable extracting and structuring knowledge from unformatted documents, facilitating its integration into structured databases for further analytical processes.

Our contributions start with creating Linked-DocRED, the first large-scale, diverse, and manually annotated dataset for document-level IE. This dataset enriches the existing DocRED dataset with high-quality entity linking labels. Additionally, we propose a novel set of metrics for evaluating end-to-end IE models. The evaluation of baseline models on Linked-DocRED highlights the complexities and challenges inherent to document-level IE: cascading errors, long context handling, and information scarcity.

We then introduce PromptORE, an unsupervised and open-world RE model. Adapting the prompt-tuning paradigm, PromptORE achieves relation embedding and clustering without requiring fine-tuning or hyperparameter tuning (a major weakness of previous baselines) and significantly outperforms state-of-the-art models. This method demonstrates the feasibility of extracting semantically coherent relation types in an open-world context.

Further extending our prompt-based approach, we develop CITRUN for unsupervised and open-world NER. By employing contrastive learning with off-domain labeled data, CITRUN improves entity type embeddings, surpassing LLM-based unsupervised NERs, and achieving competitive performance against zero-shot models that are more supervised.

These advancements facilitate meaningful knowledge extraction from unstructured documents, addressing practical, real-world constraints and enhancing the applicability of IE models in industrial contexts.