TY - JOUR
T1 - Applying the policy gradient method to behavior learning in multiagent systems
T2 - The pursuit problem
AU - Ishihara, Seiji
AU - Igarashi, Harukazu
PY - 2006/9/1
Y1 - 2006/9/1
N2 - In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.
AB - In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.
KW - Multiagent system
KW - Policy gradient method
KW - Pursuit problem
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=33747465457&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33747465457&partnerID=8YFLogxK
U2 - 10.1002/scj.20248
DO - 10.1002/scj.20248
M3 - Article
AN - SCOPUS:33747465457
SN - 0882-1666
VL - 37
SP - 101
EP - 109
JO - Systems and Computers in Japan
JF - Systems and Computers in Japan
IS - 10
ER -