Greedy Multi-Step Off-Policy Reinforcement Learning

2021 
This paper presents a novel multi-step reinforcement learning algorithms, named Greedy Multi-Step Value Iteration (GM-VI), under off-policy setting. GM-VI iteratively approximates the optimal value function of a given environment using a newly proposed multi-step bootstrapping technique, in which the step size is adaptively adjusted along each trajectory according to a greedy principle. With the improved multi-step information propagation mechanism, we show that the resulted VI process is capable of safely learning from arbitrary behavior policy without additional off-policy correction. We further analyze the theoretical properties of the corresponding operator, showing that it is able to converge to globally optimal value function, with a rate faster than traditional Bellman Optimality Operator. Experiments reveal that the proposed methods is reliable, easy to implement and achieves state-of-the-art performance on a series of standard benchmark datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    0
    Citations
    NaN
    KQI
    []