In reinforcement learning, model-free algorithms directly learn values and/or policies. Conversely, model-based algorithms learn some model of the environment. I will discuss commonalities between model-based algorithms and algorithms that, instead, replay past experiences to be consumed by an otherwise-model-free update rule. I will argue that model-based algorithms can be useful, but perhaps are not optimally utilized in the way they seem to be most commonly used. I will present several alternatives. Several insights are related to the idea of over-confidence, a theme that more generally emerges in reinforcement learning. I will draw parallels to model-free algorithms that can similarly be over-confident in some cases and discuss some solutions and directions.
Hado van Hasselt received his Ph.D. in AI from Utrecht University in the Netherlands. After doing a post-doc with Rich Sutton at the University of Alberta, he moved back to Europe to join DeepMind in 2014. Hado has been doing research in artificial intelligence and reinforcement learning, including "deep" reinforcement learning, for more than a decade. His goal is to create better general algorithms that can learn how to solve sequential decision problems without relying on, potentially brittle, domain-specific hand-crafted solutions.