learning with policy prediction in continuous state-action multi-agent decision processes

farzaneh Ghorbani,Mohsen Afsharchi,Vali Derhami

learning with policy prediction in continuous state-action multi-agent decision processes

2020

Inspired by recent attention to multi-agent reinforcement learning (MARL), the effort to provide efficient methods in this field is increasing. But, there are many issues which make this field challenging. Decision making of an agent depends on the other agents’ behavior while sharing information is not always possible. On the other hand, predicting other agents’ policies while they are also learning is a difficult task. Also, some agents in a multi-agent environment may not behave rationally. In such cases, achieving Nash equilibrium, as a target in a system with ideal behavior, is not possible and the best policy is the best response to the other agents’ policies. In addition, many real-world multi-agent problems have a continuous nature in their state and action spaces. This induces complexity in MARL scenarios. In order to overcome these challenges, we propose a new multi-agent learning method based on fuzzy least-square policy iteration. The proposed method consists of two parts: an Inner Model as one other agent policy approximation method and a multi-agent method to learn a near-optimal policy based on the others agents’ policies. Both of the proposed algorithms are applicable to problems with continuous state and action spaces. These methods can be used independently or in combination with each other. They are defined to perfectly suit each other so that the outputs of Inner Model are entirely consistent with the nature of inputs of the multi-agent method. In problems with no possibility of explicit communication, combinations of the proposed methods are recommended. In addition, theoretical analysis proves the near optimality of the policies learned by these methods. We evaluate the learning methods in problems with continuous state-action spaces: the well-known predator–prey problem and the unit commitment problem in the smart power grid. The results are satisfactory and show acceptable performance of our methods.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations