Variance aware reward smoothing for deep reinforcement learning

2021 
Abstract A Reinforcement Learning (RL) agent interacts with the environment to learn a policy with high accumulated rewards through attempts and failures. However, RL suffers from its own trial-and-error learning nature, which results in an unstable learning process. In this paper, we investigate a common phenomenon called rewards drop at the late-stage RL training session, where the rewards trajectory oscillates dramatically. In order to solve such a problem, we propose a novel rewards shaping technique named Variance Aware Rewards Smoothing (VAR). We show that the proposed method reduces the variance of rewards and mitigates the rewards drop problem without changing the formulation of the value function. Furthermore, the theoretical analysis of convergence of VAR is provided, which is derived from the γ -contraction operator and the fixed point attribute of the value function. Finally, the theoretical results are illustrated by extensive results on various benchmarks and advanced algorithms across different random seeds to demonstrate the effectiveness and the compatibility of VAR.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    0
    Citations
    NaN
    KQI
    []