Seminar by Mubarak Shah - Fine-Grained Video Retrieval

The goal of video retrieval is to develop robust representations that allow efficient search of relevant items within large video datasets. Traditional methods often fail to capture fine-grained temporal nuances, which motivates two new tasks: Alignable Video Retrieval (AVR), designed to identify temporally alignable videos, and Composed Video Retrieval (CoVR), which retrieves a target video based on a query and a modification text. To evaluate these approaches, two new datasets have been introduced: Dense-WebVid-CoVR and TF-CoVR, focusing on fine-grained and compositional actions. The talk will also present ViLL-E, a joint training framework that extends VideoLLMs to both generative tasks and embedding-based retrieval. This approach unifies video generation and retrieval while maintaining high performance.

On 19/09/2025 at 13:30 to 15:00. Amphi Gaston Berger
Informations contact : Franck Davoine. franck.davoine@cnrs.fr.

The goal of video retrieval is to learn robust representations such that a query's representation can effectively retrieve relevant items from a video gallery. While traditional methods typically return semantically related results, they often fail to ensure temporal alignment or capture fine-grained temporal nuances. To address these limitations, I will begin by introducing Alignable Video Retrieval (AVR), a novel task that tackles the previously unexplored challenge of identifying temporally alignable videos within large datasets. Next, I will present Composed Video Retrieval (CoVR), which focuses on retrieving a target video based on a query video and a modification text describing the desired change. Existing CoVR benchmarks largely focus on appearance variations or coarse-grained events, falling short in evaluating models’ ability to handle subtle, fast-paced temporal changes and complex compositional reasoning. To bridge this gap, we introduce two new datasets—Dense-WebVid-CoVR and TF-CoVR—which capture fine-grained and compositional actions across diverse video segments, enabling more detailed and nuanced retrieval tasks. I will conclude the talk with our recent work on ViLL-E: Video LLM Embeddings for Retrieval. ViLL-E extends VideoLLMs by introducing a joint training framework that supports both generative tasks (e.g., VideoQA) and embedding-based tasks such as video retrieval. This dual capability enables VideoLLMs to generate embeddings for retrieval functionality lacking in current models—without sacrificing generative performance.