Dynamic Programming Principles for Mean-Field Controls with Learning.
2019
Dynamic programming principle (DPP), or the time consistency property, is fundamental for Markov decision problems (MDPs), for reinforcement learning (RL), and more recently for mean-field controls (MFCs). However, in the learning framework of MFCs, DPP has not been rigorously established, despite its potentials for algorithm designs. In this paper, we first present a simple example in which DPP fails with a mis-specified Q function; and then propose the correct Q function for MFCs with learning. This particular Q function is different from the classical one. It integrates the Q function for single-agent RL over the state-action distribution, hence called the IQ function. In other words, MFCs with learning can be viewed as lifting the classical RLs by replacing the state-action space with its probability distribution space. This specification of the IQ function enables us to establish precisely the DPP in the learning framework of MFCs. Finally, we illustrate through numerical experiments the time consistency of the IQ function.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
42
References
16
Citations
NaN
KQI