N-step Bootstrapping For Advantage Actor-Critic
Posted 拉风小宇
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了N-step Bootstrapping For Advantage Actor-Critic相关的知识,希望对你有一定的参考价值。
by Xiaoxiao Wen, Yijie Zhang, Zhenyu Gao, Weitao Luo
1 Introduction and Motivation
In this project, we study n-step bootstrapping in actor critic methods [1], more specific, we would like to investigate, when using n-step bootstrapping for advantage actor critic (A2C), how different values of n n n will affect the model performances, measured by different metrics, for example, convergence speed and stability (variance).
1.1 N-step bootstrapping
N-step bootstrapping [1], or TD(N) is a very important technique in Reinforcement Learning that performs update based on intermediate number of rewards. In this view, N-step bootstrapping unifies and generalizes the Monte Carlo (MC) methods and Temporal Difference (TD) methods. From one extreme, when N = 1 N=1 N=1, it is equivalent to TD(1), from another extreme, when N = ∞ N=\\infty N=∞, i.e., taking as many steps as possible until the end of the episode, it becomes MC. As a result, N-step bootstrapping also combines the advantages of Monte Carlo and 1-step TD. Compared to 1-step TD, n-step bootstrapping will converge faster because it bootstraps with more real information and it is freed from the “tyranny of the time step”. Compared to MC, the updates do not have to wait until the end of the episode and it is also more efficient and less variants. In general, when facing different problems / situations, with a suitable N, we could often achieve faster and more stable learning.
1.2 Advantage Actor Critic (A2C)
Actor-Critic algorithms are a power families of learning algorithms within the policy-based framework in Reinforcement Learning. It composes of actor, the policy that makes decision and critic, the value function that evaluates if it is a good decision. With the assistant from critic, the actor can usually achieves better performance, such as by reducing gradient variance in vanilla policy gradients. From the GAE paper [2], John Schulman has unified the framework for advantage estimation, between all the GAE variants, we picked A2C considering the amazing performance of A3C and it is a simplified version of A3C with equivalent performance.
In the following sections, we first explain the method of n-step Bootstrapping for A2C, which also includes 1-step and Monte-Carlo as mentioned above, and then briefly introduce the neural network architecture. Subsequently, we introduce the conducted experiments with their corresponding settings and finally we discuss the results and draw some conclusions.
2 Methods
n-step Bootstrapping for A2C
n-step A2C is an online algorithm that uses roll-outs of size n + 1 of the current policy to perform a policy improvement step. In order to train the policy-head, an approximation of the policy-gradient is computed for each state of the roll-out
(
x
t
+
i
,
a
t
+
i
∼
π
(
⋅
∣
x
t
+
i
;
θ
π
)
,
r
t
+
i
)
i
=
0
n
\\left(x_t+i, a_t+i \\sim \\pi\\left(\\cdot | x_t+i ; \\theta_\\pi\\right), r_t+i\\right)_i=0^n
(xt+i,at+i∼π(⋅∣xt+i;θπ),rt+i)i=0n, expressed as
∇
θ
π
log
(
π
(
a
t
+
i
∣
x
t
+
i
;
θ
π
)
)
[
Q
^
i
−
V
(
x
t
+
i
;
θ
V
)
]
\\nabla_\\theta_\\pi \\log \\left(\\pi\\left(a_t+i | x_t+i ; \\theta_\\pi\\right)\\right)\\left[\\hatQ_i-V\\left(x_t+i ; \\theta_V\\right)\\right]
∇θπlog(π(at+i∣xt+i;θπ))[Q^i−V(xt+i;θV)]
where
Q
^
i
\\hatQ_i
Q^i is an estimation of the return
Q
^
i
=
∑
j
=
i
n
−
1
γ
j
−
i
r
t
+
j
+
γ
n
−
i
V
(
x
t
+
n
;
θ
V
)
\\hatQ_i=\\sum_j=i^n-1 \\gamma^j-i r_t+j+\\gamma^n-i V\\left(x_t+n ; \\theta_V\\right)
Q^i=∑j=in−1γj−irt+j+γn−iV(xt+n;θV). The gradients
j
=
i
j=i
j=i
are then added to obtain the cumulative gradient of the roll-out as
∑
i
=
0
n
∇
θ
π
log
(
π
(
a
t
+
i
∣
x
t
+
i
;
θ
π
)
)
[
Q
^
i
−
V
(
x
t
+
i
;
θ
V
)
]
\\sum_i=0^n \\nabla_\\theta_\\pi \\log \\left(\\pi\\left(a_t+i | x_t+i ; \\theta_\\pi\\right)\\right)\\left[\\hatQ_i-V\\left(x_t+i ; \\theta_V\\right)\\right]
i=0∑n∇θπlog(π(at+i∣xt+i;θπ))[Q^i−V(xt+i;θV)]
A2C trains the value-head by minimising the error between the estimated return and the value as
∑
i
=
0
n
(
Q
^
i
−
V
(
x
t
+
i
;
θ
V
)
)
2
\\sum_i=0^n\\left(\\hatQ_i-V\\left(x_t+i ; \\theta_V\\right)\\right)^2
i=0∑n(Q^i−V(xt+i;θV))2
Therefore, the network parameters
(
θ
π
,
θ
V
)
\\left(\\theta_\\pi, \\theta_V\\right)
(θπ,θV) are updated after each roll-out as follows:
θ
π
←
θ
π
+
α
π
∑
i
=
0
n
∇
θ
π
log
(
π
(
a
t
+
i
∣
x
t
+
i
;
θ
π
)
)
[
Q
^
i
−
V
(
x
t
+
i
;
θ
V
)
]
θ
V
←
θ
V
−
α
V
∑
i
=
0
n
∇
θ
V
[
Q
^
i
−
V
(
x
t
+
i
;
θ
V
)
]
2
\\beginarrayl\\theta_\\pi \\leftarrow \\theta_\\pi+\\alpha_\\pi \\sum_i=0^n \\nabla_\\theta_\\pi \\log \\left(\\pi\\left(a_t+i | x_t+i ; \\theta_\\pi\\right)\\right)\\left[\\hatQ_i-V\\left(x_t+i ; \\theta_V\\right)\\right] \\\\ \\theta_V \\leftarrow \\theta_V-\\alpha_V \\sum_i=0^n \\nabla_\\theta_V\\left[\\hatQ_i-V\\left(x_t+i ; \\theta_V\\right)\\right]^2\\endarray
θπ←θπ+απ∑i=0n∇θπlog(π(at+i以上是关于N-step Bootstrapping For Advantage Actor-Critic的主要内容,如果未能解决你的问题,请参考以下文章
R语言使用caret包中的createResample函数进行机器学习数据集采样数据集有放回的采样(bootstrapping)
论文阅读:Learning Visual Question Answering by Bootstrapping Hard Attention
sh Mac OS X 10.7 / 10.8 / 10.9(Lion / Mountain Lion / Mavericks)Bootstrapping