LIDAR: learning from imperfect demonstrations with advantage rectification

2022 
In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []