Decentralized TD(0) with Gradient Tracking

2021 
In this letter, we consider the policy evaluation problem with linear function approximation in the context of decentralized multi-agent reinforcement learning (MARL), where the agents with a fixed joint policy cooperate to estimate the global expected accumulative reward through a decentralized communication network. In the existing algorithms, every agent updates its local parameter by combining its neighboring local parameters and then running a local stochastic temporal-difference(0) (TD(0)) gradient step. However, due to the diversity of reward functions across the agents, the local stochastic TD(0) gradients can be very different, which hinders the agents from reaching the consensual and optimal parameter. Motivated by the gradient tracking strategy in decentralized optimization, we combine gradient tracking with decentralized TD(0) to accelerate the process of reaching consensus. We also propose two other acceleration strategies, one is gradient consensus while another jointly uses gradient tracking and gradient consensus. Numerical experiments demonstrate that the proposed algorithms attain faster convergence than the popular decentralized TD(0) method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    0
    Citations
    NaN
    KQI
    []