Thesis of Clément Sage

Information Extraction in numerized, semi-structured and multilingual documents

Defense date: 01/10/2021

Advisor: Alexandre Aussem
Coadvisor: Véronique Eglin, Haytham Elghazel


This thesis deals with the information extraction in business documents which are scanned or born-digital and possibly multilingual. Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Yet, automating information extraction from business documents is challenging due to the semi-structured nature of these documents, i.e. the fact that an instance of a specified document class such as invoice or purchase order mandatorily contains a predefined set of information to retrieve but the positioning and textual representation of the information are unconstrained.

Inspired by works within the Natural Language Processing (NLP) community and particularly about named entity recognition, this thesis proposes several approaches based on recurrent neural networks (RNNs) that iterate over document words retrieved by an Optical Character Recognition (OCR) engine.

Mr Doucet Antoine Professeur(e)Université La RochelleRapporteur(e)
Mme Lemaitre AurélieProfesseur(e) associé(e)Université Rennes 2Rapporteur(e)
Mme Belaïd YolandeMaître de conférenceUniversité de LorraineExaminateur​(trice)
Mme Faci Noura Professeur(e) associé(e)Université Claude Bernard Lyon 1Examinateur​(trice)
Mr Paquet ThierryProfesseur(e)Université de Rouen et NormandiePrésident(e)
Mr Aussem AlexandreProfesseur(e)Université Claude Bernard Lyon 1Directeur(trice) de thèse
Mme Eglin Véronique Professeur(e)INSA LyonCo-directeur (trice)
Mr Elghazel HaythamMaître de conférenceUniversité Claude Bernard Lyon 1Co-directeur (trice)
Mr Bérard Jean-Jacques Directeur(trice) de rechercheSociété EskerInvité(e)
Mr Espinas Jérémy ChercheurResponsable industriel, Société EskerInvité(e)