Seminar by Mubarak Shah - Fine-Grained Video Retrieval
On 19/09/2025 at 13:30 to 15:00. Amphi Gaston Berger
Informations contact : Franck Davoine. franck.davoine@cnrs.fr.
The goal of video retrieval is to learn robust representations such that a query's representation can effectively retrieve relevant items from a video gallery. While traditional methods typically return semantically related results, they often fail to ensure temporal alignment or capture fine-grained temporal nuances. To address these limitations, I will begin by introducing Alignable Video Retrieval (AVR), a novel task that tackles the previously unexplored challenge of identifying temporally alignable videos within large datasets. Next, I will present Composed Video Retrieval (CoVR), which focuses on retrieving a target video based on a query video and a modification text describing the desired change. Existing CoVR benchmarks largely focus on appearance variations or coarse-grained events, falling short in evaluating models’ ability to handle subtle, fast-paced temporal changes and complex compositional reasoning. To bridge this gap, we introduce two new datasets—Dense-WebVid-CoVR and TF-CoVR—which capture fine-grained and compositional actions across diverse video segments, enabling more detailed and nuanced retrieval tasks. I will conclude the talk with our recent work on ViLL-E: Video LLM Embeddings for Retrieval. ViLL-E extends VideoLLMs by introducing a joint training framework that supports both generative tasks (e.g., VideoQA) and embedding-based tasks such as video retrieval. This dual capability enables VideoLLMs to generate embeddings for retrieval functionality lacking in current models—without sacrificing generative performance.