Soft Actor-Critic 论文解读
Posted 白水baishui
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Soft Actor-Critic 论文解读相关的知识,希望对你有一定的参考价值。
本篇博客对于SAC论文的解读存在诸多错误,请移步我的另一篇博客 https://baishui.blog.csdn.net/article/details/121538413 。这篇博客是我对SAC算法论文的最新思考和解读。
1. 最大熵强化模型
最大熵强化学习的优化目标为(即论文公式(1)): J ( π ) = Q ( s t , a t ) = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] J(\\pi)=Q(s_t,a_t)=\\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[r(s_t,a_t)+\\alpha H(\\pi(\\cdot|s_t))] J(π)=Q(st,at)=t=0∑TE(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]其中, ( s t , a t ) ∼ ρ π (s_t,a_t)\\sim \\rho_\\pi (st,at)∼ρπ表示 s t s_t st、 a t a_t at符合 ρ π \\rho_\\pi ρπ分布; H ( π ( ⋅ ∣ s t ) ) H(\\pi(\\cdot|s_t)) H(π(⋅∣st))是熵项,用于增强探索能力; α \\alpha α是熵项的权重,控制了最优策略的随机性。
由标准强化学习优化目标公式: Q ( s t , a t ) = r ( s t , a t ) + γ ∑ t = 0 T E s t ∼ p [ V ( s t + 1 ) ] = r ( s t , a t ) + γ ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] \\beginaligned Q(s_t,a_t)& = r(s_t,a_t)+\\gamma\\sum_t=0^T\\mathbbE_s_t\\sim p [V(s_t+1)]\\\\ &=r(s_t,a_t)+\\gamma\\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi [Q(s_t+1,a_t+1)] \\\\ \\endaligned Q(st,at)=r(st,at)+γt=0∑TEst∼p[V(st+1)]=r(st,at)+γt=0∑TE(st,at)∼ρπ[Q(st+1,at+1)]
可知上面的论文公式(1)是一个缩写式,完整展开应该为: Q ( s t , a t ) = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] + α ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ H ( π ( ⋅ ∣ s t + 1 ) ) ] = r ( s t , a t ) + ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] + α ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ H ( π ( ⋅ ∣ s t + 1 ) ) ] = r ( s t , a t ) + ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ Q ( s t + 1 , a t + 1 ) + H ( π ( ⋅ ∣ s t + 1 ) ) ] ( 推 导 式 1 ) \\beginaligned Q(s_t,a_t) & = \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[r(s_t,a_t)+\\alpha H(\\pi(\\cdot|s_t))] \\\\ & = \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[r(s_t,a_t)]+\\alpha \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[H(\\pi(\\cdot|s_t+1))] \\\\ & = r(s_t,a_t) + \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[Q(s_t+1,a_t+1)]+\\alpha \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[H(\\pi(\\cdot|s_t+1))] \\\\ & = r(s_t,a_t) + \\sum_t=0^T\\mathbbE_(s_t,a_t)\\sim \\rho_\\pi[Q(s_t+1,a_t+1) + H(\\pi(\\cdot|s_t+1))]\\qquad (推导式1) \\\\ \\endaligned Q(st,at)=t=0∑TE(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]=t=0∑TE(st,at)∼ρLIRD(Deep Reinforcement Learning for List-wise Recommendations)论文算法解读
论文阅读|《Bi-level Actor-Critic for Multi-agent Coordination》(AAAI 2020)(附带源码链接)
论文阅读|《Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments》(NeurlPS,2017)(MADDPG)
论文阅读|图神经网络+Actor-Critic求解静态JSP(End-to-End DRL)《基于深度强化学习的调度规则学习》(附带源码)