Data-efficient Hindsight Off-policy Option Learning

Markus Wulfmeier,Dushyant Rao,Roland Hafner,Thomas Lampe,Abbas Abdolmaleki,Tim Hertweck,Michael Neunert,Dhruva Tirumala Bukkapatnam,Noah Siegel,Nicolas Heess,Martin Riedmiller

Data-efficient Hindsight Off-policy Option Learning

2021

Hierarchical approaches for reinforcement learning aim to improve data efficiency and accelerate learning by incorporating different abstractions. We introduce Hindsight Off-policy Options (HO2), an efficient off-policy option learning algorithm, and isolate the impact of action and temporal abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies; all with comparable policy optimization. When aiming for data efficiency, we demonstrate the importance of off-policy optimization, as even flat policies trained off-policy can outperform on-policy option methods. In addition, off-policy training and backpropagation through a dynamic programming inference procedure -- through time and through the policy components for every time-step -- enable us to train all components' parameters independently of the data-generating behavior policy. We continue to illustrate challenges in off-policy option learning and the related importance of trust-region constraints. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits in particular in more demanding simulated robot manipulation tasks from raw pixel inputs. Finally, we develop an intuitive extension to encourage temporal abstraction and investigate differences in its impact between learning from scratch and using pre-trained options.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations