Thesis of Farah Harrathi


Subject:
Automatic extraction of concepts and semantic relations from multilingual documents

Defense date: 01/10/2007

Advisor: Sylvie Calabretto

Summary:

The thesis deals with multilingual document indexing–retrieval and the semantic representation of textual document. A semantic indexing aims to extract the knowledge contained in a text by identifying the concepts and relationships between concepts. The current semantic indexing approaches are not adapted to the treatment of large multilingual corpora. Indeed, the current methods for indexing multilingual documents are manual, which makes it difficult to treat large-scale corpora. The aim of this thesis is to propose automatic indexing method to identify automatically the most relevant concepts representing the document content and relationships between concepts in multilingual corpora. In this thesis we have proposed an automatic method to extract concepts from large multilingual corpora. The proposed method is based on statistical and linguistic methods, based on mutual information, the frequency of words and textual distance. This method is validated by experimentation on various multilingual corpora. We have also proposed a method to extract relationships between the concepts using semantic resources. We are currently validating the proposed method by experimentation using multilingual ontology EUROVOC.