Prudent Policy Gradient with Auxiliary Actor in Multi-degree-of-freedom Robotic Tasks

2021 
The overestimation bias caused by the function approximation error is a common problem of the value-based reinforcement learning algorithms. A clipped Double Q-learning method and delayed policy updates are adopted by Twin Delayed Deep Deterministic policy gradient(TD3) algorithm to reduce the impact of this problem. Although TD3 brings some feasibility, the problem still has not been solved ideally. Thus, based on TD3 an novel algorithm named as Prudent Policy Gradient(PPG) is proposed, where an auxiliary actor is used to prevent actor from selecting exceeding actions and makes the agent’s behavior more prudent. This allows the proposed PPG to find a more efficient and stable policy. The experimental results illustrate that the proposed PPG outperforms TD3 in robotic tasks of several MuJoCo benchmarks and path explorations.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []