Thesis of Samaneh Chagheri

a document classification system based on knowledge

Defense date: 27/09/2012

Advisor: Sylvie Calabretto
Coadvisor: Catherine Roussey


This research takes place in an industrial context: the CONTINEW Company. This company ensures the storage and security of critical data and technical documentation. Consequently, it is necessary to organize these documents in order to retrieve quickly critical information. The management of this increasing volume of documents requires document classification which is based on indexing techniques. So, how much relevant the indexing phase is, more relevant the classification will be.

The technical documentation is by nature strongly structured. For example, the logical structure describes the role and the nature of the document elements and the hierarchical (or logical) links between them (introduction, title, section, and so one…). Such structure facilitates document presentation and can improve the indexing precision.

The classical information retrieval systems use neither the logical structure, nor the knowledge expressed in the textual content of documents. The goal of this thesis is to propose a new semantic indexing model which exploits the logical structures and the semantic contents of documents. This proposition will be evaluated on the collection of the technical documents proposed by the CONTINEW Company.

Keywords: Technical documentation, logical structure, semantic indexing, ontology, classification, document management system