Thesis of Benoît Roussel
Subject:
Start date: 01/12/2023
End date (estimated): 01/12/2026
Advisor: Liming Chen
Summary:
Occlusions (i.e., when objects are partially or completely obscured by other objects) remain a significant barrier to high
performance in scene understanding tasks. This doctoral research project aims to improve multi-object (e.g., pedestrians and
vehicles) tracking (MOT) models to make them robust to occlusions. Occlusions are challenging because:
(i)Public dataset annotations typically prioritize visible data, which is easier for humans to annotate. This bias in annotation
leads to a scarcity of labeled data that handle occlusions effectively.
(ii)Even when non visible parts of objects are fully annotated, models struggle to directly link hidden elements with visual
patterns, and have to rely heavily on contextual cues from the spatio-temporal surrounding of the element, which
often requires significantly more training data. The same phenomenon arises in 3D detection/tracking, as it typically
necessitates looking beyond pixel-based visual patterns.
To address the above mentioned difficulties, the use of very large datasets with non-human supervision (or limiting it to a few
examples) in training is a promising approach. One way is to exploit the implicit signals present in the spatio-temporal context of
many unlabeled videos, using self-supervised learning. Another is to use synthetic data generated by simulation engines, which
can benefit from having perfect labels (thereby benefiting the aforementioned 3D tasks as well). Both offer the advantage of
being relatively unlimited in dataset size, the first focusing on the quality/realism of the data and the second focusing on the
quality of the labels. By combining the two, the goal is to leverage the large size and high quality of both the data and labels,
thereby enhancing the overall training process and ultimately improving the performance of scene understanding algorithms in
difficult and dense scenarios