[Chapter 1] Markov Decision Process and Value Function
Posted 超级超级小天才
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[Chapter 1] Markov Decision Process and Value Function相关的知识,希望对你有一定的参考价值。
Markov Decision Process
One of the most important problems in decision making is to make sequential decisions, which is also the agent’s utility depends on. At each time step, the agent selects some actions to interact with the environment and make it transit to some new states, at the same time, the environment will also return some rewards to the agent. This process can be illustrated by the following figure.
We can take the following simple game as an example, in which the agent is at the lower-left corner and it can move to other blocks.
In this game, suppose that we can observe the location of the agent, which can be represented by a coordinate. We can call the location of the agent the state. All possible states in a set is the state space. In this case, the agent itself always knows where it is, so we call this fully observable environment. For the agent, at each time step, it can select to move to four directions: [UP, DOWN, LEFT, RIGHT], they are actions, and the set can be called action space, which represents all the possible actions for the agent to perform. In this example, each state has a corresponding reward, so the reward is a function of state R ( s ) R(s) R(s), or sometimes more generally R ( s , a , s ′ ) R(s,a,s^′) R(s,a,s′). The agent perfom actions to interact with the environment, the state will be transit from one to another, as you can see here, the probability for the agent to move to the expected direction is 0.8, with the probability of 0.1 for moving to left and right respectively. The outcome of the actions are stochastic, so we can define P ( s ′ ∣ s , a ) P(s^′|s,a) P(s′∣s,a) to represent the probability of reaching state s ′ s′ s′ if action a a a is done in state s s s, which is called transition function. If for all states and actions, the P ( s ′ │ s , a ) = 1 P(s^′│s,a)=1 P(s′│s,a)=1, we can call the environment deterministic.
Given the example, we can give the difinition of the Markov Decision Process (MDP), which is a sequential decision problem model for a fully observable, stochastic environment with a Markovian transition and additive rewards.
An MDP model can be defined by a tuple: ( S , A , T , R ) (S,A,T,R) (S,A,T,R), where:
- S S S: state space
- A A A: action space
- T T T: transition model (transition function), defined by P ( s ′ ∣ s , a ) P(s^′ |s,a) P(s′∣s,a)
- R R R: reward function, defined by R ( s , a , s ′ ) R(s,a,s^′) R(s,a,s′)
Policy and Value
How does an agent to make a sequence of decisions? This can be intuitive. We are using policy function to represent the decision of agent under each possible state, that means, the policy is a function from state to action, which can be represented as a = π ( s ) a=\\pi(s) a=π(s): for every state s s s, it outputs an appropriate action a a a.
Then you may ask, how to evaluate a policy? Since the key of our problem is to make a sequence of actions, the quality of a policy should be measured by expected return: the expected utility of possible state sequence generated by the policy. In another word, the value of a policy π \\pi π at a state s s s, U π ( s ) U^\\pi(s) Uπ(s) (also sometimes use notation V π ( s ) V^\\pi(s) Vπ(s)) is the expected cumulative utility of executing π \\pi π starting from s s s.
Infinite Horizon and Discounting
There are mainly two kinds of problems with respect to the horizon.
One is called finite horizon problem, in which, there is a fixed time step N N N, and after N N N steps, nothing matters anymore. In this case, the value/return is usually the addition of the rewards over the sequence:
U h ( [ s 0 , s 1 , … , s N ] ) = R ( s 0 ) + R ( s 1 ) + . . . + R ( s N ) U_h([s_0,s_1,…,s_N])=R(s_0)+R(s_1)+...+R(s_N) Uh([s0,s1,…,sN])=R(s0)+R(s1)+...+R(sN)
The finite horizon problem is usually more difficult, because the optimal action at a given state can change depending on how many steps remains to be executed, so the optimal policy is nonstationary.
Then the infinite horizon problem, without fixed time step limit, is more common and we will focus on this kind of problem. For infinite horizon problem, optimal action depends only on current state, so it’s stationary. However, it can be difficult to compute the utilities: an infinite accumulation. To solve this problem, we need to introduce the discounted reward:
U h ( [ s 0 , s 1 , … ] ) = R ( s 0 ) + γ R ( s 1 ) + γ 2 R ( s 2 ) + . . . = ∑ t = 1 ∞ γ t R ( s t ) U_h([s_0,s_1,…])=R(s_0)+\\gammaR(s_1)+\\gamma^2 R(s_2)+...=\\sum_t=1^\\infty\\gamma^t R(s_t) Uh([s0,s1,…])=R(s0)+γR(s1)+γ2R(s2)+...=t=1∑∞γtR(st)
Where γ \\gamma γ is a discount factor between 0 and 1. This is based on that the states closer to the current one is more important, so there is a higher weight for them in the accumulation. What’s more, with discounted rewards with γ < 1 \\gamma<1 γ<1 and rewards bounded by ± R m a x \\pm R_max ±Rmax, utility always finite:
U h ( [ s 0 , s 1 , … ] ) = ∑ t = 1 ∞ γ t R ( s t ) ≤ ∑ t = 1 ∞ γ t R m a x = R m a x ( 1 − γ ) U_h([s_0,s_1,…])=\\sum_t=1^\\infty\\gamma^t R(s_t) \\leq \\sum_t=1^\\infty\\gamma^t R_max=\\fracR_max(1−\\gamma) Uh([s0,s1,…])=t=1∑∞γtR(st)≤t=1∑∞γtRmax=(1−γ)Rmax
Suppose that the sequence of states [ s 0 , s 1 , … ] [s_0,s_1,…] [s0,s1,…] is determined by a policy π \\pi π and the current state s s s, then the expected utility of executing π \\pi π starting from s s s is:
U π ( s ) = E [ ∑ t = 1 ∞ γ t R ( s t ) ] U^\\pi(s)=E[\\sum_t=1^\\infty\\gamma^t R(s_t) ] Uπ(s)=E[t=1∑∞γtR(st)]
Where s 0 = s s_0=s s0=s and the expectation is with respect to distribution of state sequences determined by the policy π \\pi π. Obviously, the optimal policy is the policy with the max expected utility:
π
s
∗
以上是关于[Chapter 1] Markov Decision Process and Value Function的主要内容,如果未能解决你的问题,请参考以下文章