Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation

2017 
Reinforcement Learning (RL) is an area of machine learning in which an agent interacts with the environment by making sequential decisions. The agent receives reward from the environment to find an optimal policy that maximises the reward. Trust Region Policy Optimisation (TRPO) is a recent policy optimisation algorithm that achieves superior results in various RL benchmarks, but is computationally expensive. This paper proposes Customised Pearlmutter Propagation (CPP), a novel hardware architecture that accelerates TRPO on FPGA. We use the Pearlmutter Algorithm to address the key computational bottleneck of TRPO in a hardware efficient manner, avoiding symbolic differentiation with change of variables. Experimental evaluation using robotic locomotion benchmarks demonstrates that the proposed CPP architecture implemented on Stratix-V FPGA can achieve up to 20 times speed-up against 6-threaded Keras deep learning library with Theano backend running on a Core i7-5930K CPU.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    7
    Citations
    NaN
    KQI
    []