Thesis of Devashish Lohani

Unsupervised deep learning for spatio-temporal representations of videos: application to video surveillance

Defense date: 03/04/2023

Advisor: Laure Tougne Rodet
Coadvisor: Carlos Crispim-Junior
Cotutelle: Lionel Robinault


In the last two decades, we have witnessed a massive increase of surveillance cameras in our surroundings. One of the most important uses of these cameras is to detect suspectful or abnormal behaviors, e.g., a moving truck in a pedestrian zone or an intruder entering a prohibited site. These abnormal events occur very rarely and thus it is an extremely tedious and difficult task for professionals to attentively monitor the video constantly for finding these events. Therefore, an automatic video analysis system is essential. Traditional systems suffer to generalize across different types of anomalies, often rely on handcrafted rules and cannot adapt to abnormal events that they have never seen before. In the past few years, we have seen a tremendous progress in deep learning based video surveillance systems. These systems learn representative features from the data itself, generalize across different scenes and anomalies. That is why, in this thesis, we explore deep learning based methods. Majority of these methods in automatic video analysis are supervised, i.e., they require a large volume of labeled data. But since abnormal events depend on context and are rare, it is very difficult to have labeled anomalous data beforehand, and even if there is some annotated data for abnormal events, it will always be a small portion compared to normal data. Furthermore, one cannot annotate every possible event that might occur in future. So, we require approaches that can work without labeled data. Since these events occur in videos, they can have both spatial and temporal dimensions. Therefore, the approach must be able to learn pertinent spatio-temporal representations to differentiate abnormal and normal events.

Thus, in this PhD, we aim to learn spatio-temporal representations from unlabeled videos to detect abnormal events. Precisely, we address the task of video anomaly detection and its sub-task, perimeter intrusion detection. We provided mathematical definitions to these tasks because they were not clearly defined in the literature. The definitions have a direct impact on the evaluation and therefore, we proposed new suitable evaluation schemes. Concerning spatio-temporal representation learning without annotations, we proposed two approaches. In the first approach, we designed a strided 3D convolutional autoencoder network and it was used for the perimeter intrusion detection task. The main idea is to learn normal representation from training data without intrusions (or anomalies) and detect intrusions (or anomalies) as they deviate from learned normality. It worked well in small- length videos but suffered in long videos, which have changes in scene dynamics like weather, lighting, etc. To address this problem, we introduced an adaptive thresholding approach using moving z-score. Our extensive experiments showed the viability of our approach in comparison with other existing methods. To further improve the spatio-temporal comprehension of normality, we introduced our second approach. It consisted of a framework that leverages unsupervised and self-supervised learning in an autoencoder. To be precise, we proposed multiple, carefully designed tasks (unsupervised and self-supervised) that are performed in a single autoencoder. This method is also trained in an end-to-end and joint manner, where training data is without anomalies or intrusions. For detecting anomalies (or intrusions), each of the task provide an anomaly score and the combined score is used for final detection. This approach is generic and was applied to the two tasks. We obtained state-of-the-art results in all major public datasets for both video anomaly detection and perimeter intrusion detection task.

Keywords: deep learning, computer vision, unsupervised learning, self-supervised learning, video surveillance, video anomaly detection, perimeter intrusion detection

M. Nicolas THOMEProfesseur(e)CNAM ParisRapporteur(e)
M. Thierry CHATEAUProfesseur(e)Université Clermont Ferrand IIRapporteur(e)
M. François BRéMONDDirecteur(trice) de rechercheInria Sophia AntipolisPrésident(e)
Mme Jenny BENOIS-PINEAUProfesseur(e)Université de BordeauxExaminateur​(trice)
Mme Laure Tougne RodetProfesseur(e)Université Lyon 2Directeur(trice) de thèse
M. Carlos CRISPIM-JUNIORMaître de conférenceUniversité Lyon 2Co-encadrant(e)
M. Lionel ROBINAULTDocteurFoxstreamCo-encadrant(e)