Soft Actor-Critic 论文解读
Posted 白水baishui
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Soft Actor-Critic 论文解读相关的知识,希望对你有一定的参考价值。
1. 最大熵强化模型
最大熵强化学习的优化目标为(即论文公式(1)): J ( π ) = Q ( s t , a t ) = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] J(\\pi)=Q(s_t,a_t)=\\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[r(s_t,a_t)+\\alpha H(\\pi(\\cdot|s_t))] J(π)=Q(st,at)=t=0∑TE(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]其中, ( s t , a t ) ∼ ρ π (s_t,a_t)\\sim \\rho_\\pi (st,at)∼ρπ表示 s t s_t st、 a t a_t at符合 ρ π \\rho_\\pi ρπ分布; H ( π ( ⋅ ∣ s t ) ) H(\\pi(\\cdot|s_t)) H(π(⋅∣st))是熵项,用于增强探索能力; α \\alpha α是熵项的权重,控制了最优策略的随机性。
由标准强化学习优化目标公式: Q ( s t , a t ) = r ( s t , a t ) + γ ∑ t = 0 T E s t ∼ p [ V ( s t + 1 ) ] = r ( s t , a t ) + γ ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] \\begin{aligned} Q(s_t,a_t)& = r(s_t,a_t)+\\gamma\\sum_{t=0}^{T}\\mathbb{E}_{s_t\\sim p} [V(s_{t+1})]\\\\ &=r(s_t,a_t)+\\gamma\\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi} [Q(s_{t+1},a_{t+1})] \\\\ \\end{aligned} Q(st,at)=r(st,at)+γt=0∑TEst∼p[V(st+1)]=r(st,at)+γt=0∑TE(st,at)∼ρπ[Q(st+1,at+1)]
可知上面的论文公式(1)是一个缩写式,完整展开应该为: Q ( s t , a t ) = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] + α ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ H ( π ( ⋅ ∣ s t + 1 ) ) ] = r ( s t , a t ) + ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] + α ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ H ( π ( ⋅ ∣ s t + 1 ) ) ] = r ( s t , a t ) + ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) + H ( π ( ⋅ ∣ s t + 1 ) ) ] ( 推 导 式 1 ) \\begin{aligned} Q(s_t,a_t) & = \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[r(s_t,a_t)+\\alpha H(\\pi(\\cdot|s_t))] \\\\ & = \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[r(s_t,a_t)]+\\alpha \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[H(\\pi(\\cdot|s_{t+1}))] \\\\ & = r(s_t,a_t) + \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[Q(s_{t+1},a_{t+1})]+\\alpha \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[H(\\pi(\\cdot|s_{t+1}))] \\\\ & = r(s_t,a_t) + \\sum_{t=0}^{T}\\mathbb{E}_{(s_t,a_t)\\sim \\rho_\\pi}[Q(s_{t+1},a_{t+1}) + H(\\pi(\\cdot|s_{t+1}))]\\qquad (推导式1) \\\\ \\end{aligned} Q(st,at)=t=0∑TE(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]=t=0∑TE(st,at)∼ρ以上是关于Soft Actor-Critic 论文解读的主要内容,如果未能解决你的问题,请参考以下文章
LIRD(Deep Reinforcement Learning for List-wise Recommendations)论文算法解读
论文阅读|《Bi-level Actor-Critic for Multi-agent Coordination》(AAAI 2020)(附带源码链接)
论文阅读|《Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments》(NeurlPS,2017)(MADDPG)