[Chapter 6] Reinforcement Learning Policy Search
Posted 超级超级小天才
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[Chapter 6] Reinforcement Learning Policy Search相关的知识,希望对你有一定的参考价值。
In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:
π ( s ) = a r g m a x a Q ( s , a ) {\\pi}(s)=arg max_a{Q(s,a)} π(s)=argmaxaQ(s,a)
This means that once we have learnt the Q-function well, we can get an optimal policy, so before this, all methods were directly or indirectly learning the Q-function, however, for the policy search method, it tries to update the policy function directly.
Policy Search
Based on the function approximation, we can write the policy function as:
π ( s ) = a r g m a x a Q ^ ( s , a ) {\\pi}(s)=arg max_a{\\hat{Q}(s,a)} π(s)=argmaxaQ^(s,a)
As a function mapping from state to action, the policy function is also a function with parameters θ {\\theta} θ to learn. Then policy search method adjusts θ {\\theta} θ to improve the policy directly without approximate the Q-values or utilities.
However, in the formula above, there are two main problem we need to solve firstly:
- The operation argmax is not differentiable, which makes the gradient based search difficult
- In the environment with discrete actions, which means the outputs of the function are discrete
In fact, one method can solve them easily, you can think the problem to be a classification problem, why? When the agent selects an action, it selects the action with the highest Q-value regards the current state; in a classification problem, our model predicts the probability for each class that the input belongs to and output the class with the highest probability. They are one same thing actually. Remember how we solve the classification problem? Yes, we are using softmax function, here we can also use it:
π θ ( s , a ) = e Q ^ θ ( s , a ) ∑ a ′ e Q ^ θ ( s , a ′ ) {\\pi}_{\\theta}(s,a)=\\frac{e^{\\hat{Q}_{\\theta}(s,a)}}{\\sum_{a^′}{e^{\\hat{Q}_{\\theta}(s,a^′)}}} πθ(s,a)=∑a′eQ^θ(s,a′)eQ^θ(s,a)
Given a state s s s, the model can classify it to a class which indicates which action to execute (with highest Q-value).
Using the gradient method, we can get the parameter update formula:
θ i + 1 = θ i + α G j ∇ θ π θ ( s , a i ) π θ ( s , a i ) {\\theta}_{i+1}={\\theta}_i+{\\alpha}G_j \\frac{\\nabla_{\\theta} {\\pi}_{\\theta}(s,a_i)}{{\\pi}_{\\theta} (s,a_i)} θi+1=θi+αGjπθ(s,ai)∇θπθ(s,ai)
Another version for the above formulas is to perform logarithmic operations on both sides of the equation, then we can get:
θ i + 1 = θ i + α G j ∇ θ l n π θ ( s , a i ) {\\theta}_{i+1}={\\theta}_i+{\\alpha} G_j \\nabla_{\\theta} ln{{\\pi}_{\\theta}(s,a_i)} θi+1=θi+αGj∇θlnπθ(s,ai)
Variance Reduction using a Baseline
Another technology is using a baseline to reduce the variance of the Q-function, to replace the Q π θ ( s , a ) Q_{{\\pi}_{\\theta} }(s,a) Qπθ(s,a) with Q π θ ( s , a ) − B ( s ) Q_{{\\pi}_{\\theta}} (s,a)−B(s) Qπθ(s,a)−B(s). Usually, a natural choice for the baseline is V π θ ( s ) V_{{\\pi}_{\\theta}}(s) Vπθ(s), then we define a new advantage function:
A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) A_{{\\pi}_{\\theta}}(s,a)=Q_{{\\pi}_{\\theta}} (s,a)−V_{{\\pi}_{\\theta}}(s) Aπθ(s,a)=Qπθ(s,a)−Vπθ(s)
Actor Critic
Actor-Critic algorithm tries to combine both the Q-function based learning and the policy search together. It establishes two outputs, one learns a policy that takes action, called actor, at the same time, another learns a value or Q-function that is used only for evaluation, called critic. It divided the evaluation and improvement into two parts, they are executed alternatively.
In the DRL, to save the memory and training time, we usually let these two parts share the bottom layers that are used for feature extracting and divide the network at a higher layer.
以上是关于[Chapter 6] Reinforcement Learning Policy Search的主要内容,如果未能解决你的问题,请参考以下文章
[Chapter 6] Reinforcement Learning Policy Search
[Chapter 5] Reinforcement LearningFunction Approximation
[Chapter 5] Reinforcement LearningFunction Approximation
[Chapter 5] Reinforcement LearningFunction Approximation