Funded Post-doc position in Lyon

Deep Learning and Deep-Reinforcement Learning for Human Centered Vision and Robotics


Lyon, France
INSA-Lyon and LIRIS Laboratory / CITI Laboratory, INRIA Chroma work group

When, duration

Begin: Early 2018
Duration: 12 months


2000€ per month (net = after taxes)
French-Canadian ANR/NSERC project “DeepVision” (2016-2020)


Christian Wolf
Julien Mille


Learning of deep hierarchical representations (“deep learning”) is established as a powerful methodology in computer vision, capable of learning complex prediction models from large amounts of data. This PhD position builds on previous work on deep learning for human motion understanding at the LIRIS laboratory in Lyon, France ([1-5] and others). The candidate will work on applications in computer vision related to understanding humans, in particular the recognition of complex activities [3,4,5].

Human perception focuses selectively on parts of the scene to acquire information at specific places and times. In machine learning, this kind of process is referred to as attention mechanism, and has drawn increasing interest when dealing with languages, images and other data. Integrating attention can potentially lead to improved overall accuracy, as the system can focus on parts of the data, which are most relevant to the task. In particular, mechanisms of visual attention currently play an important role in many current vision tasks [3][6-10].

The objective of this post-doc is to advance the state-of-the-art in human-centered vision and robotics through visual attention mechanisms for human understanding. A particular focus will be put on two different applications : Mechanisms of visual attention for videos and still images (see Figure 1); “Physical” attention mechanisms, where the agent is not virtual but physical. This translates to tasks where mobile robots optimize their location/navigation in order to solve complex visual tasks (see Figure 2).

In terms of methodological contributions, this research will focus on deep learning and deep reinforcement learning for agent control [11] and for vision [6,12].

The post-doctoral candidate will participate in the ongoing collaborations between INSA-Lyon and University of Guelph, Canada, on Deep Learning; on UPMC/LIP6, on Deep Learning; and with INRIA (CHROMA research group) on reinforcement learning and agent control.

Figure 1: a visual attention process selects parts of the video relevant to the task [3].

Figure 2: a fleet of robots jointly observing a complex visual scene and optimizing their positions with respect to the task.


[1] Natalia Neverova, Christian Wolf, Graham W. Taylor and Florian Nebout. ModDrop: adaptive multi-modal gesture recognition. To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016.

[2] Natalia Neverova, Christian Wolf, Florian Nebout, Graham W. Taylor. Hand Pose Estimation through Weakly-Supervised Learning of a Rich Intermediate Representation. Arxiv:1511.06728, 2015.

[3] Fabien Baradel, Christian Wolf, Julien Mille. Pose-conditioned Spatio-Temporal Attention for Human Action Recognition. Arxiv:1703.10106, 2017.

[4] Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, Christophe Garcia, Bülent Sankur. Evaluation of video activity localizations integrating quality and quantity measurements. In Computer Vision and Image Understanding (127):14-30, 2014.

[5] Moez Baccouche, Frank Mamalet, Christian Wolf, Christophe Garcia, Atilla Baskurt. Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In the Proceedings of the British Machine Vision Conference (BMVC), 2012.

[6] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent models of visual attention. In NIPS. 2014.

[7] Jason Kuen, Zhenhua Wang, and Gang Wang. Recurrent Attentional Networks for Saliency Detection. In CVPR, 2016.

[8] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action Recognition using Visual Attention. ICLR Workshop track, 2016.

[9] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. Pre-print : arxiv :1611.06067, 2016.

[10] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to- end Learning of Action Detection from Frame Glimpses in Videos. In CVPR, 2016.

[11] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, page 529–533, 2015

[12] M. Gygli, M. Norouzi and A. Angelova. Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs, arxiv 3/2017.