[Chapter 2] Value Iteration and Policy Iteration

Posted 超级超级小天才

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[Chapter 2] Value Iteration and Policy Iteration相关的知识,希望对你有一定的参考价值。

We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Value Iteration and Policy Iteration.

Value Iteration

Based on the formula of expected utility with discounted factor:

U π ( s ) = E [ ∑ t = 1 ∞ γ t R ( s t ) ] U^\\pi(s)=E[\\sum_t=1^\\infty\\gamma^t R(s_t)] Uπ(s)=E[t=1γtR(st)]

and the transition function P ( s ′ ∣ s , a ) P(s^′|s,a) P(ss,a) defined in MDP model, there is an equation for the value function to intutively statisfy:

V ( s ) = R ( s ) + γ m a x a ∈ A ( s ) ⁡ ∑ s ′ P ( s ′ │ s , a ) V ( s ′ ) V(s)=R(s)+\\gamma max_a \\in A(s)⁡\\sum_s^′P(s^′│s,a)V(s^′) V(s)=R(s)+γmaxaA(s)sP(ss,a)V(s)

which is called Bellman equation.

Unfortunately, it’s very difficult to solve the Bellman euqation, since there is a m a x max max operator, so that’s why we need value iteration method.

Based on the Bellman equation, we can get the Bellman updata:

U t + 1 ( s ) ← R ( s ) + γ m a x a ∈ A ( s ) ⁡ ∑ s ′ P ( s ′ │ s , a ) U t ( s ′ ) U_t+1(s) \\leftarrow R(s)+\\gamma max_a \\in A(s)⁡ \\sum_s^′P(s^′│s,a)U_t(s^′) Ut+1(s)R(s)+γmaxaA(s)sP(ss,a)Ut(s)

Where t t t represents the iteration time steps.

The value iteration algorithm can be described as following:

We can initialize all utilities for all states as 0, and using the Bellman update formula to compute new utilities step by step until it converges (all values reach unique fixed points. This will save much more time than solve the Bellman equations directly.

Policy Iteration

Now, think about this, we are updating the values/utilities for each state in the first method, but in policy iteration, we initialize and update the policies. This is based on that sometimes, to find the optimal policy, we don’t really need to find the highly accurate value function, e.g. if one action is clearly better than others. So we give up computing the values using Bellman update, we initialize the policies as π 0 \\pi_0 π0 and then update them. To do this, we alternate the following two main steps:

  • Policy evaluation: given the current policy π t \\pi_t πt, calculate the next-step utilities U t + 1 U_t+1 Ut+1:

U t + 1 ( s ) ← R ( s ) + γ ∑ s ′ P ( s ′ │ s , π t ( s ) ) U t ( s ′ ) U_t+1(s) \\leftarrow R(s)+\\gamma \\sum_s^′P(s^′│s,\\pi_t(s))U_t(s^′) Ut+1(s)R(s)+γsP(ss,πt(s))Ut(s)

This update formula is similar but simpler than Bellman update formula, there is no need to compute the maximum value among all possible actions, but using the actions given by the policy at time step t t t.

  • Policy inrovement: Using the calculated one step look-ahead utilities U t + 1 ( s ) U_t+1(s) Ut+1(s) to compute new policy π t + 1 \\pi_t+1 πt+1.

To improve the policy, we need to choose another better policy to replace the current one, to do so, we need to introduce the action-value function or Q-function for policy π \\pi π:

Q π ( s , a ) = R ( s ) + ∑ s ′ P ( s ′ │ s , a ) U π ( s ′ ) Q^\\pi(s,a)=R(s)+\\sum_s^′P(s^′│s,a)U^\\pi(s^′) Qπ(s,a)=R(s)+sP(ss,a)Uπ(s)

The main different for Q-function in comparison to the value function is that the Q-function is the expected utility after determining the exact action a a a. Suppose the size of the state space and action space is ∣ S ∣ |S| S and ∣ A ∣ |A| A respectively, then for each policy π \\pi π, there will be ∣ S ∣ |S| S value functions, one for each state, and will be ∣ S ∣ × ∣ A ∣ |S| \\times |A| S×A Q-functions, one for each state-action pair.

The policy improvement theorem can be very easy then: suppose a new policy π ′ \\pi′ π that for all s ∈ S s \\in S sS:

Q π ( s , π ′ ( s ) ) ≥ U π ( s ) Q^\\pi(s,\\pi′(s)) \\geq U^\\pi(s) Qπ(s,π(s))Uπ(s)

Then for all s ∈ S s \\in S sS, there will be:

U π ′ ( s ) ≥ U π ( s ) U^\\pi′(s) \\geq U^\\pi(s) Uπ(s)Uπ(s)

In this case, we can say that π ′ \\pi′ π is better than π \\pi π, so we can improve the policy from π \\pi π to π ′ \\pi′ π.

Alternatively do the above two steps until there is no change in policy, we can get the optimal policy, this is the policy iteration algorithm, describing as following:

More Accurate Policy Iteration Methods

The policy iteration method stated above with both doing policy evaluation and policy improvement one step is called generalized policy iteration, which is a very general and common method. However, there are some other more accurate methods based on more accurate policy evaluation.

  • For each step of policy evaluation, not only update the utilities one time step, but using the following set of equations to solve the accurate utilities:

U t ( s ) = R ( s ) + γ ∑ s ′ P ( s ′ │ s , π t ( s ) ) U t ( s ′ ) U_t(s)=R(s)+\\gamma \\sum_s^′P(s^′│s,\\pi_t(s))U_t(s^′) Ut(s)=R(s)+γsP(ss,πt(s))Ut(s)

For N N N states, there are N N N equations and they can be solved in O ( N 3 ) O(N^3) O(N3) time using some basic linear algebra knowledge. This will be much more complex but much more accurate, however, it’s a bit too much for almost all problems, we ususlly don’t need the most accurate utilities.