DRL Hands-on book

Posted 2021-12-07 zerotensor

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了DRL Hands-on book相关的知识，希望对你有一定的参考价值。

代码：https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On

Chapter 1 What is Reinforcement Learning

Learning - supervised, unsupervised, and reinforcement

RL is not completely blind as in an unsupervised learning setup--we have a reward system.

(1) life is suffering, which could be totally wrong. In machine learning terms, it can be rephrased as having non-i.i.d data.

(2) exploration/exploitation dilemma is one of the open fundamental question in RL.

(3) the third complication factor lays in the fact that reward can be seriously delayed from actions.

RL fromalisms and realtions

RL entities and their communications

Agent和Environment是图的两个node
Actions作为edge由Agent指向Environment
Rewards和Observations作为edge由Environment指向Agent

Reward

We don‘t define how frequently the agent receives this reward. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

The agent

The environment

Action

two types of actions: discrete or continuous.

Observations

Markov decision process

It is the theoretical foundation of RL, which makes it possible to start moving toward the methods used to solve the RL problem.

we start from the simplest case of a Markov Process(also known as a Markov chain), then extend it with rewards, which will turn it into a Markov reward processes. Then we‘ll put this idea into one other extra envelop by adding actions, which will lead us to Markov Decision Processes.

Markov process

you can always make your model more complex by extending your state space, which will allow you to capture more dependencies in the model at the cost of a large state space.

you can capture transition probabilities with a transition matrix, which is a square matrix of the size NxN, where N is the number of states in your model.

可以根据观测的episodes来估计transition matrix

Markov reward process

first thing is to add reward to Markov process model.

representation: reward transition matrix or a more compact representation, which is applicable only if the reward value depends only on the target state, which is not always the case.

second thing is to add discount factor gamma(from 0 to 1).

Markov decision process

add a dimension ‘action‘ to transition matrix.

Chapter 2 OpenAI Gym

Chapter 3 Deep Learning with PyTorch

Chapter 4 The Cross-Entropy Method

Taxonomy of RL methods

Model-free or model-based
Value-based or policy-based
On-policy or off-policy

Practional cross-entropy

以上是关于DRL Hands-on book的主要内容，如果未能解决你的问题，请参考以下文章