Thesis of Moez Baccouche

Subject:

Automatic detection of deformable visual objects and automatic video indexing

Start date:
Defense date:

Advisor: Atilla Baskurt
Codirection: Franck Mamalet

Summary:

Multimedia content indexing currently relies on global descriptors, built from digital signatures which are intended to summarize the image content in terms of distribution of light intensity, color or texture. These descriptive signatures, used as index, consist of low level measures, close to the image signal and particularly sensitive to noise. Even if these descriptors are useful to compare multimedia documents, they are unable to describe semantically their content, and are difficult to handle for a user in order to search a specific document. However, search engines based on linguistic queries require the detection of high-level indexes closer to the concept of visual objects such as faces, human bodies, buildings to name but a few examples. They also require a categorization of video segments, an automatic recognition of their content: news, commercials, football, etc...

This PhD aims to semantically categorize video segments, obtained from the automatic detection of shots and from a macro-segmentation based on inter-programs detection. First, we will focus on developing new techniques for modeling and localizing objects of interest based only on their visual appearance, without a priori modeling or heuristic filtering, but by automatic learning from samples directly extracted from images. This work will follow previous activities led in France Telecom R&D, based on neural models. We will focus on the detection and recognition of deformable objects, by a joint consideration of texture and movement in a video. An example of application may be detecting and tracking moving objects such as faces in TV news or players in sports videos. Then, we will focus on automatic recognition of a video segment theme. To do this, we will follow-up previous research works, aiming at the categorization of collections of still images, and will extend them to the case of video. In this case, each video frame will be processed globally a signature, including color, texture and movement measures enabling to summarize its contents. Robust statistical and neural learning techniques will be implemented to categorize the content according to example database of the given concepts.