Self-Supervised Continuous Control without Policy Gradient

Hao Sun,Ziping Xu,Meng Fang,Yuhang Song,Jiechao Xiong,Bo Dai,Zhengyou Zhang,Bolei Zhou

Self-Supervised Continuous Control without Policy Gradient

2021

Hao Sun
Ziping Xu
Meng Fang
Yuhang Song
Jiechao Xiong
Bo Dai
Zhengyou Zhang
Bolei Zhou

Despite the remarkable progress made by the policy gradient algorithms in reinforcement learning (RL), sub-optimal policies usually result from the local exploration property of the policy gradient update. In this work, we propose a method called Zeroth-Order Supervised Policy Improvement (ZOSPI) that exploits the estimated value function Q globally while preserving the local exploitation of the policy gradient methods. Experiments show that ZOSPI achieves competitive results on the MuJoCo benchmarks with a remarkable sample efficiency. Moreover, different from the conventional policy gradient methods, the policy learning of ZOSPI is conducted in a self-supervised manner. We show such a self-supervised learning paradigm has the flexibility of including optimistic exploration as well as adopting a non-parametric policy.

Keywords:

Reinforcement learning
Computer science
Exploit
Policy learning
Bellman equation
Mathematical optimization

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations