Decoupling Value and Policy for Generalization in Reinforcement Learning.

2021 
Standard deep reinforcement learning algorithms use a shared representation for the policy and value function. However, we argue that more information is needed to accurately estimate the value function than to learn the optimal policy. Consequently, the use of a shared representation for the policy and value function can lead to overfitting. To alleviate this problem, we propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment. IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors. Moreover, IDAAC learns representations, value predictions, and policies that are more robust to aesthetic changes in the observations that do not change the underlying state of the environment.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    62
    References
    4
    Citations
    NaN
    KQI
    []