Thesis of Quentin Gallouedec
Start date: 05/02/2021
End date: 05/02/2024
Advisor: Emmanuel Dellandréa
State-of-the-art reinforcement learning algorithms rely on the availability of a reward signal which is informative enough to converge to an optimal policy. These algorithms fail when rewards are deceptive or sparse, which is often the case in robotic tasks. To overcome this issue, the agent must adopt an efficient exploration strategy that does not rely on exposure to the rewards of the environment.
Intrinsic motivation has been widely studied in the literature and has greatly improved the performance of these algorithms on many environments, especially hard exploration environments. Nevertheless, algorithms based on intrinsic motivation may fail to solve hard exploration tasks that are easily solved by humans (eg, Montezuma Revenge game). The cause of failure comes from two major weaknesses: (1) detachment: the agent loses track of unexplored areas due to the depletion of the intrinsic reward of intermediate areas, and (2) derailment: the agent is unable to return to previously visited states. Therefore, a new paradigm has emerged consisting in training an agent to return to states and then to explore from these states (First Return Then Explore). This new paradigm is the first to achieve superhuman performance on the Montezuma revenge and on the Pitfall games.
Nevertheless, this approach relies on the partitioning of the observation space into cells. Goals are selected depending on the cell visitation count. The way of partitioning the space is critical, and conditions much the results. This partition even makes this algorithm completely useless in the context of Procedurally Generated Environments (PGE). We therefore argue that his method does not adapt well to a large state space, such as the pixel-based observation space. On the other hand, work on intrinsic motivation has explored many ways to compute intrinsic reward in large and continuous spaces.
Therefore, in this PhD project, we propose to take advantage of the best of these two approaches by proposing an algorithm following the Go-Explore paradigm for which goal selection strategy is adapted to large and continuous state spaces.