Thesis of Louisa Kessi

Subject:

Automatic Modeling and Recognition of Heterogeneous Logical Structures from Digitized Business Documents.

Start date: 11/03/2013
Defense date: 10/03/2017

Advisor: Christophe Garcia
Coadvisor: Frank Lebourgeois

Summary:

Every day millions of documents are treated by large enterprises, governments and all small and medium enterprises for an exorbitant cost when it is done manually. The Automatic Document Recognition (ADR) is a software solution that can automatically read the scanned documents and extract useful information to inform information systems and treat them rapidly. In this framework we are trying to develop a recognition system of document structures by image analysis order to find the logical function for each block of text and the hierarchical organization of the information.
In the context of the dematerialization of a flow of documents to contents and to heterogeneous forms, the recognition system should be generic enough address all possible types of documents without any knowledge a priori on their contents and to overcome the challenge of moving to scale due to the heterogeneity of document structures and the highly variable of existing documents contents and a great variability can still be found in the manner of organizing information. We also
want to use learning methods able to find appropriate information to be retained under a better generalization of the knowledgebase and avoid the specialization in specific cases. The research will be focused mainly to the automatic modelisation , by learning of specific models for recurring particular documents and generic models for any documents.To achieve these objectives, the study will give priority to probabilistic models for representing spatial relations between graphical and textual information. The models will rely on the physical structure of documents and the analysis of the information organization.
The modeling assessment will cover both the measurement of the reduction of the human intervention for the parameterization and errors correction and the system performances on a flux of heterogeneous documents. This evaluation will assess at the same time the automaticity, the scaling and the genericity of the model. A strong constraint will be the realization of a generic recognition system able to decode all the structures of all documents are a research problematic rarely studied. This is explained by the
difficulty of the subject and the lack of consequential databases for a large-scale assessment. Therefore, the majority of researches in the world do experiments on a very small scale on homogeneous and regular documents.