Thesis of Aurélien Cecille
Subject:
Start date: 01/02/2023
End date (estimated): 01/02/2026
Advisor: Stefan Duffner
Coadvisor: Franck Davoine
Summary:
This thesis studies self-supervised monocular depth estimation, with a particular focus on side camera mirrors used in trucks and buses. In this setting, large amounts of unlabeled video are available, while depth annotations are costly to acquire. Self-supervised learning therefore provides an attractive training paradigm by exploiting geometric consistency between consecutive images instead of relying on ground-truth depth maps. However, despite strong progress in recent years, standard self-supervised monocular methods still face important limitations for practical deployment: they do not recover metric scale reliably, and they produce blurred depth discontinuities at object boundaries.
The first contribution of this thesis addresses scale ambiguity. This self-supervised method injects ground information into the training process by using known camera parameters to derive an analytic depth prior assuming a flat ground. It is then combined with a learned attention mechanism that identifies where this prior is trustworthy in the image. A dedicated loss formulation couples ground selection and scale recovery, allowing the network to learn metrically meaningful depth without requiring depth supervision or ground mask annotations. Experiments on KITTI show competitive metric depth performance among self-supervised approaches, while additional evaluations demonstrate stronger robustness to camera pose changes and improved zero-shot transfer to unseen cameras and datasets such as DDAD.
The second contribution addresses the loss of sharpness at object boundaries. We propose to use a depth formulation in which each pixel in the depth map is represented by a two-component gaussian mixture rather than by a single scalar value. This representation explicitly models the coexistence of foreground and background hypotheses at occlusion edges. The predicted distributions are propagated through reprojection, interpolation, and photometric loss computation, making the full self-supervised pipeline compatible with multimodal depth prediction. The resulting model produces sharper discontinuities, reduces floating artifacts in reconstructed point clouds, and provides uncertainty estimates that are meaningful for downstream applications. We also introduce an edge entropy measure to quantify boundary sharpness more directly than conventional global depth metrics.
Taken together, these contributions show that incorporating explicit structure into self-supervised learning improves monocular depth estimation along dimensions that standard photometric objectives capture poorly. Ground geometry provides a practical route to metric scale when such a prior is available, while mixture-based depth representations better reflect the ambiguity present at occlusion boundaries.