2021/ Decentralized multi-agent reinforcement learning with networked agents: recent advances
My Takeaway: The RL structure studied in this paper is exactly what I am looking for. This paper concludes pure theoretical research and review on related works. Both inspired me.
This article is a theoretical analysis of one kind of RL structures: decentralized and networked agents. The motivation behind this article is that, there are increasingly more researches using this kind of structure, there are however little theoretical researches to model its structure and its goals properly.
Details to this kind of structure: networked means all agents are connected by a network, decentralized/distributed means there is not a central controller, and each agent can only communicate with its neighbors. "Where each agent's objective and system evolution are both affected by the joint decision made by all agents".
The author addressed that there are four major challenges in MARL, "especially in the associated theoretical analyses": 1. learning goals are not one-dimensional, 2. the environment faced by each agent is non-stationary, 3. the exponential increase in the joint action-space (DNN for function approximation), 4. the decentralized information structure.
Analysis of single-agent RL:
- Dynamic programming / backward induction requires the full knowledge of the model, whereas RL is devise to find the optimal policy without knowing the model.
- value-based methods lead to taking greedy action with respect to the estimate (Q-function), whereas policy-based methods propose to directly search for the optimal one over the policy space.
Analysis of multi-agent RL:
- a classical framework of MARL: Markov/Stochastic games (MGs).
- the convergence of most MARL algorithms to Nash Equilibrium makes MGs the most standard framework.
- Model where agents are fully cooperative and share a common reward function: multi-agent MDPs (MMDPs) or Markov teams (therefore enabling the use of single-agent RL algorithms).
- not fully cooperative agents corresponds to the MGs with either zero-sum or general-sum reward functions.
- network MMDPs is a generalization of the above common reward cooperative model, where different reward functions/preferences are allowed, and necessitated the incorporation of more efficient communication protocols.
Review of MARL algorithms
Goals of algorithms are categorized into learning, policy evaluation and others.
learning with fully cooperative setting
- studies on decentralized/distributed algorithms using this information structure: average consensus, distributed/consensus optimization.
- The idea of decentralized MARL over networked agents - 2009/average consensus, policy based. No theoretical analysis.
- First MARL algorithm with provable convergence guarantees - 2013/consensus + innovation, QD-learning.
The following parts are bit hard for me because I haven't written much in RL, especially when I haven't read or written any concrete algorithm regarding policy optimization. Therefore my reading ends here.