[Machine Learning for Trading] {ud501} Lesson 25: 03-05 Reinforcement learning | Lesson 26: 03-06 Q-

Posted ecoflex

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[Machine Learning for Trading] {ud501} Lesson 25: 03-05 Reinforcement learning | Lesson 26: 03-06 Q-相关的知识,希望对你有一定的参考价值。

The RL problem

技术图片

 

 

 

Trading as an RL problem 

 技术图片

 

 

 

Mapping trading to RL 

技术图片

 

 

 

Markov decision problems 

技术图片

 

 

 

 Unknown transitions and rewards

 技术图片

 

 

 

 What to optimize?

 技术图片

 

 

 

 

 技术图片

 

 

 技术图片

 

 

 






 

 

 

 

 技术图片

 

 

 Learning Procedure

 技术图片

 

 

Update Rule

 技术图片

 

 

Update Rule

The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s‘, r>, is:
Q‘[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s‘, argmaxa‘(Q[s‘, a‘])])

Here:

    • r = R[s, a] is the immediate reward for taking action a in state s,
    • γ ∈ [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards,
    • s‘ is the resulting next state,
    • argmaxa‘(Q[s‘, a‘]) is the action that maximizes the Q-value among all possible actions a‘ from s‘, and,
    • α ∈ [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.

 

 

 

 Two Finer Points

 技术图片

 

 

 

The Trading Problem: Actions 

 技术图片

 

 

 

 技术图片

A reward at each step allows the learning agent get feedback on each individual action it takes (including doing nothing).

 

 

 

技术图片

SMA: single moving average => different stocks have different basis

=> adj close / SMA is a good normalized factor

 

 

 

Creating the State 

 技术图片

 

 

Discretizing 

 技术图片

 

 

 Q-Learning Recap

技术图片

 

 

 

Summary

Advantages

  • The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.
  • As a result, we do not need additional data structures to store transitions T(s, a, s‘) or rewards R(s, a).
  • Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state (maxa Q(s, a)) as well as the best policy in terms of the action that should be taken (argmaxa Q(s, a)).

Issues

  • The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.
  • Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you‘ll end up losing a lot of money!).
  • In the next lesson, we will discuss an algorithm that tries to address this second problem by simulating the effect of actions based on historical data.

 

 

 

 

Resources

 

 

 

 






 

 

 

 

Dyna-Q Big Picture <= invented by Richard Sutton

技术图片

技术图片

 

 

 

 Learning T

 技术图片

 

 

 

How to Evaluate T? 

技术图片

 

Type in your expression using MathQuill - a WYSIWYG math renderer that understands LaTeX.

E.g.:

  • to enter Tc, type: T_c
  • to enter Σ, type: \\Sigma

For entering a fraction, simply type / and MathQuill will automatically format it. Try it out!

Correction: The expression should be:
技术图片 In the denominator shown in the video, T is missing the subscript c.

 

 

 

 Learning R

技术图片

 

 

 

Dyna Q Recap 

 技术图片

 

 

 

 

Summary

The Dyna architecture consists of a combination of:

  • direct reinforcement learning from real experience tuples gathered by acting in an environment,
  • updating an internal model of the environment, and,
  • using the model to simulate experiences.

技术图片

Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]

 

 

 

 

 

Resources

  • Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, 1990. [pdf]
  • Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
  • RL course by David Silver (videos, slides)
    • Lecture 8: Integrating Learning and Planning [pdf]

 

 

 






 

 

 

Interview with Tammer Kamel

Tammer Kamel is the founder and CEO of Quandl - a data platform that makes financial and economic data available through easy-to-use APIs.

Listen to this two-part interview with him.

  • Part 1: The Quandle Data Platform (08:18)
  • Part 2: Trading Strategies and Nuances (10:53)

Note: The interview is audio-only; closed captioning is available (CC button in the player).

 

以上是关于[Machine Learning for Trading] {ud501} Lesson 25: 03-05 Reinforcement learning | Lesson 26: 03-06 Q-的主要内容,如果未能解决你的问题,请参考以下文章

[Machine Learning for Trading] {ud501} Lesson 21: 03-01 How Machine Learning is used at a hedge fund

机器学习- 吴恩达Andrew Ng 编程作业技巧 for Week6 Advice for Applying Machine Learning

Advice for students of machine learning

I'm back for Machine Learning

SAP热招DevOps Engineer for Machine Learning(Shanghai)

CS224W摘要01.Introduction; Machine Learning for Graphs