Thesis of Devashish Lohani

Unsupervised deep learning of spatio-temporal representations for video


Identifying actions or series of events depending on experience is a crucial part of human decision-making. Recent progress in deep learning allows us to perform automated video analysis but most of these algorithms rely on huge amount of labelled data (supervised learning). Using non-labelled videos, we want to learn a deep network encoding video representation in an unsupervised way. The goal is to capture the spatio-temporal nature of videos in a single model, instead of addressing independently spatial (images) and temporal dimensions as prior work. The same way the first layers of 2D convolutional networks encode local descriptors adapted to images, we want to learn spatio-temporal descriptors that allow us to model video events. Once learned, these descriptors may be used for a supervised task, such as action recognition or unsupervised tasks like intrusion or anomaly detection.

Advisor: Laure Tougne
Coadvisor: Carlos Crispim-Junior