Thesis of Vincent Malleron


Subject:
Enrichment and valorization of social sciences corpus : electronic edition of Flaubert's dossiers de Bouvard et Pécuchet

Start date: 01/10/2007
End date (estimated): 01/10/2010

Advisor: Hubert Emptoz
Coadvisor: Véronique Eglin

Summary:

The main goal of this work is the development of image processing tools for valorization and enrichment of an human science corpus : Flaubert's dossiers de Bouvard et Pécuchet. This corpus is rich of about 3500 hand-written pages,gathered by Flaubert in order to write the second volume of Bouvard et pécuchet (his posthumous encyclopedie) and dealing with various themes of the 19th century.
Our work aims at the development of tools to simplify the mobilization of the documentary base for the electronic edition (Automatic ROI detection for indexation, layout analysis, Image of text and transcription alignment...).
In a first time we will research an automatic and adaptive segmentation process of manuscript images in order to define regions in every image, corresponding to differents caracteristics ( text areas, erase, images, margins, etc.).
We will also extract recurring typographical features (printed text, multiple script writers, hand features) and extract metadatas for indexation.
In a second time, we will focus on data structuration and on navigation inside of the corpus in order to simplify content access and to perform analysis of the circulation between text fragments.
This work, which takes place at the border between human sciences and information sciences will offer a new way of access to a great documentary corpus, representative of the 19th century.