Thesis of Aurélien Cecille

Subject:

Strucuture-Aware Self-Supervised Depth Estimation

Start date: 01/02/2023
Defense date: 12/06/2026

Advisor: Stefan Duffner
Coadvisor: Franck Davoine

Summary:

This thesis studies self-supervised monocular depth estimation, with a particular focus on side camera mirrors used in trucks and buses. In this setting, large amounts of unlabeled video are available, while depth annotations are costly to acquire. Self-supervised learning therefore provides an attractive training paradigm by exploiting geometric consistency between consecutive images instead of relying on ground-truth depth maps. However, despite strong progress in recent years, standard self-supervised monocular methods still face important limitations for practical deployment: they do not recover metric scale reliably, and they produce blurred depth discontinuities at object boundaries.

The first contribution of this thesis addresses scale ambiguity. This self-supervised method injects ground information into the training process by using known camera parameters to derive an analytic depth prior assuming a flat ground. It is then combined with a learned attention mechanism that identifies where this prior is trustworthy in the image. A dedicated loss formulation couples ground selection and scale recovery, allowing the network to learn metrically meaningful depth without requiring depth supervision or ground mask annotations. Experiments on KITTI show competitive metric depth performance among self-supervised approaches, while additional evaluations demonstrate stronger robustness to camera pose changes and improved zero-shot transfer to unseen cameras and datasets such as DDAD.

The second contribution addresses the loss of sharpness at object boundaries. We propose to use a depth formulation in which each pixel in the depth map is represented by a two-component gaussian mixture rather than by a single scalar value. This representation explicitly models the coexistence of foreground and background hypotheses at occlusion edges. The predicted distributions are propagated through reprojection, interpolation, and photometric loss computation, making the full self-supervised pipeline compatible with multimodal depth prediction. The resulting model produces sharper discontinuities, reduces floating artifacts in reconstructed point clouds, and provides uncertainty estimates that are meaningful for downstream applications. We also introduce an edge entropy measure to quantify boundary sharpness more directly than conventional global depth metrics.

Taken together, these contributions show that incorporating explicit structure into self-supervised learning improves monocular depth estimation along dimensions that standard photometric objectives capture poorly. Ground geometry provides a practical route to metric scale when such a prior is available, while mixture-based depth representations better reflect the ambiguity present at occlusion boundaries.

Jury:

M. David Picard	Directeur(trice) de recherche	ENPC IP Paris	Rapporteur(e)
M. Philippe Xu	Professeur(e)	ENSTA - IP Paris	Rapporteur(e)
M. Liming Chen	Professeur(e)	LIRIS - Ecole Centrale Lyon	Examinateur(trice)
Mme Véronique Cherfaoui	Professeur(e)	UTC Compiègne	Examinateur(trice)
Mme Céline Teulière	Maître de conférence	Université Clermont Auvergne	Examinateur(trice)
M. Franck Davoine	Directeur(trice) de recherche	LIRIS CNRS UMR 5205	Co-directeur (trice)
M. Stefan Duffner	Professeur(e)	LIRIS INSA Lyon	Co-directeur (trice)
M. Rémi Agier	Docteur	Visual Behavior	Invité(e)
M. Thibault Neveu	Ingénieur(e) de recherche	Visual Behavior	Invité(e)