Dynamic Programming Principles for Mean-Field Controls with Learning.

2019 
Dynamic programming principle (DPP), or the time consistency property, is fundamental for Markov decision problems (MDPs), for reinforcement learning (RL), and more recently for mean-field controls (MFCs). However, in the learning framework of MFCs, DPP has not been rigorously established, despite its potentials for algorithm designs. In this paper, we first present a simple example in which DPP fails with a mis-specified Q function; and then propose the correct Q function for MFCs with learning. This particular Q function is different from the classical one. It integrates the Q function for single-agent RL over the state-action distribution, hence called the IQ function. In other words, MFCs with learning can be viewed as lifting the classical RLs by replacing the state-action space with its probability distribution space. This specification of the IQ function enables us to establish precisely the DPP in the learning framework of MFCs. Finally, we illustrate through numerical experiments the time consistency of the IQ function.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    16
    Citations
    NaN
    KQI
    []