Reinforcement Learning endowed with safe veto policies to learn the control of Linked-Multicomponent Robotic Systems

2015 
Performing reinforcement learning-based control of systems whose state space has many Undesired Terminal States (UTS) experiences severe convergence problems. We define UTS as terminal states without associated positive reward information. They appear in the training of over-constrained systems, when breaking a constraint implies that all the effort invested during a learning episode is lost without gathering any constructive information about how to achieve the target task. The random exploration performed by RL algorithms is unfruitful until the system reaches any final state bearing some reward that may be used to update the state-action value functions, hence UTS seriously impede the convergence of the learning process. The most efficient learning strategies avoid reaching any UTS, ensuring that each learning process episode provides useful reward information. Safe Modular State Action Veto (Safe-MSAV) policies learn specifically how to avoid state transitions leading to an UTS. The application of MSAV makes state space exploration much more efficient. Bigger ratio of UTS to the total number of states provide greater improvements. Safe-MSAV uses independent concurrent modules, each dealing with a separate kind of UTS. We report experiments on the control of Linked Multicomponent Robotic Systems (L-MCRS) showing a dramatic decrease on the computational resources required, ensuring faster as well as more accurate results than conventional exploration strategies that do not implement explicit mechanisms to avoid falling in UTS.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    47
    References
    8
    Citations
    NaN
    KQI
    []