Pytorch Note50 Gym 介绍

Posted 风信子的猫Redamancy

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pytorch Note50 Gym 介绍相关的知识,希望对你有一定的参考价值。

Pytorch Note50 Gym 介绍


全部笔记的汇总贴: Pytorch Note 快乐星球

在前面的笔记中,简单的介绍了强化学习的例子,但是我们会发现构建强化学习的环境非常麻烦,需要耗费我们大量的时间,这个时候我们可以使用一个开源的工具,叫做 gym,是由 open ai 开发的。

在这个库中从简单的走格子到毁灭战士,提供了各种各样的游戏环境可以让大家放自己的 AI 进去玩耍。取名叫 gym 也很有意思,可以想象一群 AI 在健身房里各种锻炼,磨练技术。

使用起来也非常方便,首先在终端内输入如下代码进行安装。

# Github源
git clone https://github.com/openai/gym
cd gym
pip install -e .[all]

# 直接下载gym包
pip install gym[all]

我们可以访问这个页面看到 gym 所包含的环境和介绍

在上面的环境页面,可以 gym 内置了很多环境,我们可以使用前面讲过的 q learning 尝试一个 gym 中的小例子,mountain car

mounttain car

这是这样一个问题,一辆汽车在一维轨道上,位于两座“山”之间。 目标是在右边开车上山; 然而,这辆车的引擎不够强劲,无法一次性翻越这座山。 因此,成功的唯一方法是来回驱动以积聚动力。

在 mounttain car,我们能够观察到环境中小车的位置,也就是坐标,我们能够采取的动作是向左或者向右。

为了使用 q learning,我们必须要建立 q 表,而这里的状态空间是连续不可数的,所以我们需要离散化连续空间,将 x 坐标和 y 坐标都平均分成很多份,具体的实现可以运行 mount-car.py 看看结果。

代码如下

import numpy as np

import gym

n_states = 40  # 取样 40 个状态
iter_max = 10000

initial_lr = 1.0  # Learning rate
min_lr = 0.003
gamma = 1.0
t_max = 10000
eps = 0.02


def run_episode(env, policy=None, render=False):
    obs = env.reset()
    total_reward = 0
    step_idx = 0
    for _ in range(t_max):
        if render:
            env.render()
        if policy is None:  # 如果没有策略,就随机取样
            action = env.action_space.sample()
        else:
            a, b = obs_to_state(env, obs)
            action = policy[a][b]
        obs, reward, done, _ = env.step(action)
        total_reward += gamma ** step_idx * reward
        step_idx += 1
        if done:
            break
    return total_reward


def obs_to_state(env, obs):
    """
    将观察的连续环境映射到离散的输入的状态
    """
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_dx = (env_high - env_low) / n_states
    a = int((obs[0] - env_low[0]) / env_dx[0])
    b = int((obs[1] - env_low[1]) / env_dx[1])
    return a, b


if __name__ == '__main__':
    env_name = 'MountainCar-v0'
    env = gym.make(env_name)
    env.seed(0)
    np.random.seed(0)
    print('----- using Q Learning -----')
    q_table = np.zeros((n_states, n_states, 3))
    for i in range(iter_max):
        obs = env.reset()
        total_reward = 0
        ## eta: 每一步学习率都不断减小
        eta = max(min_lr, initial_lr * (0.85 ** (i // 100)))
        for j in range(t_max):
            x, y = obs_to_state(env, obs)
            if np.random.uniform(0, 1) < eps:  # greedy 贪心算法
                action = np.random.choice(env.action_space.n)
            else:
                logits = q_table[x, y, :]
                logits_exp = np.exp(logits)
                probs = logits_exp / np.sum(logits_exp)  # 算出三个动作的概率
                action = np.random.choice(env.action_space.n, p=probs)  # 依概率来选择动作
            obs, reward, done, _ = env.step(action)
            total_reward += reward
            # 更新 q 表
            x_, y_ = obs_to_state(env, obs)
            q_table[x, y, action] = q_table[x, y, action] + eta * (
                    reward + gamma * np.max(q_table[x_, y_, :]) -
                    q_table[x, y, action])
            if done:
                break
        if i % 100 == 0:
            print('Iteration #%d -- Total reward = %d.' % (i + 1,
                                                           total_reward))
    solution_policy = np.argmax(q_table, axis=2)  # 在 q 表中每个状态下都取最大的值得动作
    solution_policy_scores = [
        run_episode(env, solution_policy, False) for _ in range(100)
    ]
    print("Average score of solution = ", np.mean(solution_policy_scores))
    # Animate it
    run_episode(env, solution_policy, True)
----- using Q Learning -----
Iteration #1 -- Total reward = -200.
Iteration #101 -- Total reward = -200.
Iteration #201 -- Total reward = -200.
Iteration #301 -- Total reward = -200.
Iteration #401 -- Total reward = -200.
Iteration #501 -- Total reward = -200.
Iteration #601 -- Total reward = -200.
Iteration #701 -- Total reward = -200.
Iteration #801 -- Total reward = -200.
Iteration #901 -- Total reward = -200.
Iteration #1001 -- Total reward = -200.
Iteration #1101 -- Total reward = -200.
Iteration #1201 -- Total reward = -200.
Iteration #1301 -- Total reward = -200.
Iteration #1401 -- Total reward = -200.
Iteration #1501 -- Total reward = -200.
Iteration #1601 -- Total reward = -200.
Iteration #1701 -- Total reward = -200.
Iteration #1801 -- Total reward = -200.
Iteration #1901 -- Total reward = -200.
Iteration #2001 -- Total reward = -200.
Iteration #2101 -- Total reward = -200.
Iteration #2201 -- Total reward = -200.
Iteration #2301 -- Total reward = -200.
Iteration #2401 -- Total reward = -200.
Iteration #2501 -- Total reward = -200.
Iteration #2601 -- Total reward = -200.
Iteration #2701 -- Total reward = -200.
Iteration #2801 -- Total reward = -200.
Iteration #2901 -- Total reward = -200.
Iteration #3001 -- Total reward = -200.
Iteration #3101 -- Total reward = -200.
Iteration #3201 -- Total reward = -200.
Iteration #3301 -- Total reward = -200.
Iteration #3401 -- Total reward = -200.
Iteration #3501 -- Total reward = -200.
Iteration #3601 -- Total reward = -200.
Iteration #3701 -- Total reward = -200.
Iteration #3801 -- Total reward = -200.
Iteration #3901 -- Total reward = -200.
Iteration #4001 -- Total reward = -200.
Iteration #4101 -- Total reward = -200.
Iteration #4201 -- Total reward = -200.
Iteration #4301 -- Total reward = -200.
Iteration #4401 -- Total reward = -200.
Iteration #4501 -- Total reward = -200.
Iteration #4601 -- Total reward = -200.
Iteration #4701 -- Total reward = -200.
Iteration #4801 -- Total reward = -200.
Iteration #4901 -- Total reward = -200.
Iteration #5001 -- Total reward = -200.
Iteration #5101 -- Total reward = -200.
Iteration #5201 -- Total reward = -200.
Iteration #5301 -- Total reward = -200.
Iteration #5401 -- Total reward = -200.
Iteration #5501 -- Total reward = -200.
Iteration #5601 -- Total reward = -200.
Iteration #5701 -- Total reward = -200.
Iteration #5801 -- Total reward = -200.
Iteration #5901 -- Total reward = -200.
Iteration #6001 -- Total reward = -200.
Iteration #6101 -- Total reward = -200.
Iteration #6201 -- Total reward = -200.
Iteration #6301 -- Total reward = -200.
Iteration #6401 -- Total reward = -200.
Iteration #6501 -- Total reward = -200.
Iteration #6601 -- Total reward = -200.
Iteration #6701 -- Total reward = -200.
Iteration #6801 -- Total reward = -200.
Iteration #6901 -- Total reward = -200.
Iteration #7001 -- Total reward = -200.
Iteration #7101 -- Total reward = -200.
Iteration #7201 -- Total reward = -200.
Iteration #7301 -- Total reward = -200.
Iteration #7401 -- Total reward = -200.
Iteration #7501 -- Total reward = -200.
Iteration #7601 -- Total reward = -200.
Iteration #7701 -- Total reward = -200.
Iteration #7801 -- Total reward = -200.
Iteration #7901 -- Total reward = -200.
Iteration #8001 -- Total reward = -198.
Iteration #8101 -- Total reward = -200.
Iteration #8201 -- Total reward = -200.
Iteration #8301 -- Total reward = -200.
Iteration #8401 -- Total reward = -200.
Iteration #8501 -- Total reward = -200.
Iteration #8601 -- Total reward = -200.
Iteration #8701 -- Total reward = -200.
Iteration #8801 -- Total reward = -200.
Iteration #8901 -- Total reward = -200.
Iteration #9001 -- Total reward = -200.
Iteration #9101 -- Total reward = -200.
Iteration #9201 -- Total reward = -200.
Iteration #9301 -- Total reward = -200.
Iteration #9401 -- Total reward = -200.
Iteration #9501 -- Total reward = -200.
Iteration #9601 -- Total reward = -200.
Iteration #9701 -- Total reward = -200.
Iteration #9801 -- Total reward = -200.
Iteration #9901 -- Total reward = -200.
Average score of solution =  -129.96

如果运行完之后,可以看到 q 表的收敛非常慢,reward 一直都很难变化,我们需要很久才能将小车推到终点,这个时候我们需要一个更加强大的武器,那就 deep q network。

以上是关于Pytorch Note50 Gym 介绍的主要内容,如果未能解决你的问题,请参考以下文章

Pytorch Note1 Pytorch介绍

Pytorch Note31 深度残差网络 ResNet

Pytorch Note52 灵活的数据读取介绍

Pytorch Note 快乐星球

Pytorch Note8 简单介绍torch.optim(优化)和模型保存

Pytorch Note46 生成对抗网络的数学原理