强化学习理论入门(Trust Region Policy Optimization介绍)

Posted Jie Qiao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了强化学习理论入门(Trust Region Policy Optimization介绍)相关的知识,希望对你有一定的参考价值。

介绍

本文主要介绍Trust Region Policy Optimization这篇文章,这篇文章主要回答了如下2个问题:

  1. 两个不同策略的value function,他们的差异是多少?
  2. 有什么办法可以保证,一个策略相比于另外一个策略一定能够提升呢?

针对这两个问题,我们先定义一些基本的概念,

基本定义

下图是一个较为一般的强化学习MDP框架下的概率图模型

注意,这个图并不一定通用,特别是reward(比如 s t + 1 s_t+1 st+1可以不指向 r t + 1 r_t+1 rt+1),可能是需要考虑具体场景,但这是一个较为一般化结构。基于此,我们有:

马尔科夫性质

P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ s t , a t , r t , … , r 1 , s 0 , a 0 ) = P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ s t , a t ) P\\left( S_t+1 =s^\\prime ,R_t+1 =r^\\prime \\mid s_t ,a_t ,r_t ,\\dotsc ,r_1 ,s_0 ,a_0\\right) =P\\left( S_t+1 =s^\\prime ,R_t+1 =r^\\prime \\mid s_t ,a_t\\right) P(St+1=s,Rt+1=rst,at,rt,,r1,s0,a0)=P(St+1=s,Rt+1=rst,at)

状态转移矩阵:

P s s ′ a = ∑ r ′ P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ S t = s , A t = a ) \\mathcalP_ss'^a =\\sum _r' P\\left( S_t+1 =s^\\prime ,R_t+1 =r^\\prime \\mid S_t =s,A_t =a\\right) Pssa=rP(St+1=s,Rt+1=rSt=s,At=a)

下一时刻的reward:

R s s ′ a = E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = 1 P s s ′ a ∑ r ′ r ′ P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ S t = s , A t = a ) \\mathcalR_ss'^a =E[ R_t+1 \\mid S_t =s,A_t =a,S_t+1 =s'] =\\frac1\\mathcalP_ss'^a\\sum _r' r'P\\left( S_t+1 =s^\\prime ,R_t+1 =r^\\prime \\mid S_t =s,A_t =a\\right) Rssa=E[Rt+1St=s,At=a,St+1=s]=Pssa1rrP(St+1=s,Rt+1=rSt=s,At=a)

轨迹序列联合概率:

p ( τ ∣ π ) = p ( s 0 ) ∏ t = 0 T − 1 p ( s t + 1 , r t + 1 ∣ s t , a t )   π ( a t ∣ s t ) p(\\tau \\mid \\pi )=p( s_0)\\prod _t=0^T-1 p( s_t+1 ,r_t+1 \\mid s_t ,a_t) \\ \\pi ( a_t \\mid s_t) p(τπ)=p(s0)t=0T1p(st+1,rt+1st,at) π(atst)

State Value Function:

V π ( s ) =   E τ ∼ π [ ∑ l = 0 ∞ γ l r t + l + 1 ∣ S t = s ] = ∑ a t , r t + 1 , s t + 1 . . . . p ( s t ) ∏ t ∞ p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ∑ l = 0 ∞ γ l r t + l + 1 = ∑ a t , r t + 1 , s t + 1

以上是关于强化学习理论入门(Trust Region Policy Optimization介绍)的主要内容,如果未能解决你的问题,请参考以下文章

图灵奖得主John Hopcroft推荐的这本强化学习入门书

[TRPO] Trust Region Policy Optimization

[TRPO] Trust Region Policy Optimization

[TRPO] Trust Region Policy Optimization

[TRPO] Trust Region Policy Optimization

强化学习教程 - 先了解几个公式