Greedy Multi-Step Off-Policy Reinforcement Learning

Yuhui Wang,Xiaoyang Tan

Greedy Multi-Step Off-Policy Reinforcement Learning

2021

This paper presents a novel multi-step reinforcement learning algorithms, named Greedy Multi-Step Value Iteration (GM-VI), under off-policy setting. GM-VI iteratively approximates the optimal value function of a given environment using a newly proposed multi-step bootstrapping technique, in which the step size is adaptively adjusted along each trajectory according to a greedy principle. With the improved multi-step information propagation mechanism, we show that the resulted VI process is capable of safely learning from arbitrary behavior policy without additional off-policy correction. We further analyze the theoretical properties of the corresponding operator, showing that it is able to converge to globally optimal value function, with a rate faster than traditional Bellman Optimality Operator. Experiments reveal that the proposed methods is reliable, easy to implement and achieves state-of-the-art performance on a series of standard benchmark datasets.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations