TY - GEN
T1 - Policy gradient reinforcement learning with environmental dynamics and action-values in policies
AU - Ishihara, Seiji
AU - Igarashi, Harukazu
N1 - Publisher Copyright:
© 2011, Springer-Verlag Berlin Heidelberg.
PY - 2011
Y1 - 2011
N2 - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.
AB - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.
UR - http://www.scopus.com/inward/record.url?scp=80053154140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053154140&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23851-2_13
DO - 10.1007/978-3-642-23851-2_13
M3 - Conference contribution
AN - SCOPUS:80053154140
SN - 9783642238505
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 120
EP - 130
BT - Knowledge-Based and Intelligent Information and Engineering Systems - 15th International Conference, KES 2011, Proceedings
T2 - 15th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, KES 2011
Y2 - 12 September 2011 through 14 September 2011
ER -