Thesis of Arnaud Renard

Subject:

Models of semantic information retrieval in structured Web documents

Abandoned thesis: 01/09/2012

Advisor: Sylvie Calabretto

Summary:

Today’s society is evolving and relies on more tools and practices related to information technologies. This is mostly due to the evolution of communication infrastructures. Indeed, the difficulty no longer lies in information availability but rather in access to relevant information according to the user. In order to help in information management, the Web is growing according to two tendencies.
On one side, the first one deals with the larger availability of more structured data. That means that large amounts of data which were formerly stored in flat textual files are now frequently stored in (semi-)structured XML based files. That is the reason why we choose to deal with this kind of documents.
On the other side, the second one brings semantic aware techniques in order to achieve better machine level understanding of those data. Semantics is one of the greatest challenges in IR systems evolution. Indeed the use of semantics in Information Retrieval systems can be an efficient way to solve data heterogeneity problems: both in terms of content and data structure representation (documents which follow neither the same DTD nor the same XML schema). Usually, this challenge needs an additional external semantic resource related to documents collection. It is necessary to have semantic similarity measures in order to work with semantic resources and compare concepts. Similarity measures assume that concepts related to terms have been identified without ambiguity. Therefore, misspelled terms interfere in term to concept matching process.
Existing semantic aware Information Retrieval systems lay on basic concept identification but don’t care about terms spelling uncertainty. Our goal in a first time is to improve results by taking into account common mistakes in indexed documents such as typos or wrong words spelling. This kind of problem applies to many Web 2.0 applications as well as mails and forums.
In order to evaluate expected gains, we plan to evaluate our models on different datasets:
-INEX (Wikipedia XML)
-TREC (Confusion Track)