TY - JOUR
T1 - Policy gradient reinforcement learning with separated knowledge
T2 - Environmental dynamics and action-values in policies
AU - Ishihara, Seiji
AU - Igarashi, Harukazu
N1 - Publisher Copyright:
© 2016 The Institute of Electrical Engineers of Japan.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2016
Y1 - 2016
N2 - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.
AB - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.
KW - Action-value
KW - Environmental dynamics
KW - Policy gradient method
KW - Pursuit problem
KW - Reinforcement learning
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=84960455941&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84960455941&partnerID=8YFLogxK
U2 - 10.1541/ieejeiss.136.282
DO - 10.1541/ieejeiss.136.282
M3 - Article
AN - SCOPUS:84960455941
SN - 0385-4221
VL - 136
SP - 282
EP - 289
JO - IEEJ Transactions on Electronics, Information and Systems
JF - IEEJ Transactions on Electronics, Information and Systems
IS - 3
ER -