深度学习与图神经网络核心技术实践应用高级研修班-Day2股票预测(stock_prediction)

Posted ZSYL

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了深度学习与图神经网络核心技术实践应用高级研修班-Day2股票预测(stock_prediction)相关的知识,希望对你有一定的参考价值。


1. 股票预测背景

  • 股票价格是典型的时间序列数据(简称时序数据),会受到经济环境、政府政策、人为操作多种复杂因素的影响。
  • 不像气象数据那样具备明显的时间和季节性模式,例如一天之内和一年之内的去气温变化等。
  • 以股票价格为例,介绍如何对时序数据进行预测。

2. 股票数据来源

S&P 500 股价数据爬取自 Google Finance API,已进行过缺失值处理。

3. 股票数据预处理

用pandas读取 csv 文件为 DataFrame,利用该库查看特征
的数值分布以及概要。

  • 数据共 502 列,41266行,502 列分别为:

    • DATE:该行数据的时间戳
    • SP500:大盘指数;
    • 其他:可以理解为 500 支个股的股价。

4. 同步预测

同步预测:使用当前时刻的 500 支个股股价,预测当前时刻的大盘指数,即⼀个回归问题,输入共 500 维特征,输出⼀维,即 [None, 500] ⇒ [None, 1]。

使用TensorFlow实现同步预测, 主要用到多层感知机,
损失函数用均方误差。

5. 同步预测效果

6. 异步预测

异步预测:使用历史若干个时刻的大盘指数,预测当前时
刻的大盘指数,这样才更加符合预测的定义。

例如,使用历史若干个时刻的 500 支个股股价以及大盘指
数,预测当前时刻的大盘指数,即 [None, 5, 501] ⇒ [None, 1]。

使用Keras实现异步预测,主要用到RNN中的 LSTM。

7. 完整展示

# https://github.com/sebastianheinz/stockprediction
# https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using-tensorflow-30505541d877
# select gpu
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '3'
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler
import time
from tensorflow import keras
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.losses import MSE
from tensorflow.keras.optimizers import Adam
# read stacks data
data = pd.read_csv('data_stocks.csv')
data.describe()
DATESP500NASDAQ.AALNASDAQ.AAPLNASDAQ.ADBENASDAQ.ADINASDAQ.ADPNASDAQ.ADSKNASDAQ.AKAMNASDAQ.ALXN...NYSE.WYNNYSE.XECNYSE.XELNYSE.XLNYSE.XOMNYSE.XRXNYSE.XYLNYSE.YUMNYSE.ZBHNYSE.ZTS
count4.126600e+0441266.00000041266.00000041266.00000041266.0000041266.00000041266.00000041266.00000041266.00000041266.000000...41266.00000041266.00000041266.00000041266.00000041266.00000041266.00000041266.00000041266.00000041266.00000041266.000000
mean1.497749e+092421.53788247.708346150.453566141.3179379.446873103.480398102.99860850.894352122.981163...97.942211104.74066646.66440243.04398480.78459519.30071854.54198871.757891121.42351560.183874
std3.822211e+0639.5571353.2593776.2368266.916742.0002834.4242449.3897884.83393111.252010...5.41179510.6066941.5084441.7145331.84098911.6865323.5263214.0382725.6070703.346887
min1.491226e+092329.13990040.830000140.160000128.2400074.80000095.87000083.00000044.65000096.250000...83.41000089.51000044.09000039.12000076.0600006.66000048.82000063.180000110.12000052.300000
25%1.494432e+092390.86010044.945400144.640000135.1950078.030000101.30000094.82000047.440000116.950000...95.96000095.01000045.15500041.95500080.2200007.04500051.63000069.110000117.58000059.620000
50%1.497638e+092430.14990048.360000149.945000142.2600079.410000102.440000106.82000049.509900123.620000...99.25000099.66000046.81000043.20000081.15000027.89000053.85000073.470000120.65000061.585600
75%1.501090e+092448.82010050.180000155.065000147.1000080.580000104.660000110.49000052.230000132.218800...102.080000117.03470047.73000044.37000082.06205030.47000057.14000074.750000126.00000062.540000
max1.504210e+092490.64990054.475000164.510000155.3300090.440000121.770000119.27000062.560000142.875000...106.375000123.87000049.66000047.21000083.63000032.93000062.13000077.120000133.45000063.840000

8 rows × 502 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41266 entries, 0 to 41265
Columns: 502 entries, DATE to NYSE.ZTS
dtypes: float64(501), int64(1)
memory usage: 158.0 MB
data.head()
DATESP500NASDAQ.AALNASDAQ.AAPLNASDAQ.ADBENASDAQ.ADINASDAQ.ADPNASDAQ.ADSKNASDAQ.AKAMNASDAQ.ALXN...NYSE.WYNNYSE.XECNYSE.XELNYSE.XLNYSE.XOMNYSE.XRXNYSE.XYLNYSE.YUMNYSE.ZBHNYSE.ZTS
014912262002363.610142.3300143.6800129.630082.040102.230085.220059.760121.52...84.370119.03544.4039.8882.037.3650.2263.86122.00053.350
114912262602364.100142.3600143.7000130.320082.080102.140085.650059.840121.48...84.370119.03544.1139.8882.037.3850.2263.74121.77053.350
214912263202362.679942.3100143.6901130.225082.030102.212585.510059.795121.93...84.585119.26044.0939.9882.027.3650.1263.75121.70053.365
314912263802364.310142.3700143.6400130.072982.000102.140085.487259.620121.44...84.460119.26044.2539.9982.027.3550.1663.88121.70053.380
414912264402364.850142.5378143.6600129.880082.035102.060085.700159.620121.60...84.470119.61044.1139.9682.037.3650.2063.91121.69553.240

5 rows × 502 columns

print('begin', time.strftime('%Y-%m-%d', time.localtime(data['DATE'].min())))
print('end', time.strftime('%Y-%m-%d', time.localtime(data['DATE'].max())))
begin 2017-04-03
end 2017-09-01
plt.plot(data['SP500'])

# train : test = 0.8 : 0.2
data.drop('DATE', axis=1, inplace=True)
data_train = data.iloc[:int(data.shape[0] * 0.8), :]
data_test = data.iloc[int(data.shape[0] * 0.8):, :]
print('training data shape:', data_train.shape)
print('test data shape', data_test.shape)
training data shape: (33012, 501)
test data shape (8254, 501)
# scale data into the range for -1 to 1
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
# seq_len = 5
# X_train = np.array([data_train[i : i + seq_len, 0] for i in range(data_train.shape[0] - seq_len)]).squeeze()
# y_train = np.array([data_train[i + seq_len, 0] for i in range(data_train.shape[0] - seq_len)])
# X_test = np.array([data_test[i : i + seq_len, 0] for i in range(data_test.shape[0] - seq_len)]).squeeze()
# y_test = np.array([data_test[i + seq_len, 0] for i in range(data_test.shape[0] - seq_len)])
model = Sequential(layers=[
    Dense(1024, activation='relu', input_shape=X_train.shape[1:]),
    Dense(512, activation='relu'),
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(1, activation='linear'),
])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1024)              513024    
_________________________________________________________________
dense_1 (Dense)              (None, 512)               524800    
_________________________________________________________________
dense_2 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_3 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
=================================================================
Total params: 1,202,177
Trainable params: 1,202,177
Non-trainable params: 0
_________________________________________________________________
model.compile(optimizer=Adam(lr=0.001), loss=MSE)
from tensorflow.keras.callbacks import LambdaCallback
def on_epoch_end(epoch, logs):
    y_pred = model.predict(X_test)
    plt.plot(y_test, label='test')
    plt.plot(y_pred, label='pred')
    plt.legend()
    plt.show()
model.fit(X_train, y_train,
          epochs=5,
          batch_size=256,
          shuffle=True,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])
Epoch 1/5
32512/33012 [============================>.] - ETA: 0s - loss: 0.0101

33012/33012 [==============================] - 2s 73us/step - loss: 0.0099
Epoch 2/5
32768/33012 [============================>.] - ETA: 0s - loss: 1.7015e-04

33012/33012 [==============================] - 1s 42us/step - loss: 1.6941e-04
Epoch 3/5
32000/33012 [============================>.] - ETA: 0s - loss: 2.1466e-04

33012/33012 [==============================] - 1s 42us/step - loss: 2.1038e-04
Epoch 4/5
31488/33012 [===========================>..] - ETA: 0s - loss: 1.1966e-04

33012/33012 [==============================] - 1s 44us/step - loss: 1.1896e-04
Epoch 5/5
32000/33012 [============================>.] - ETA: 0s - loss: 2.0731e-04

33012/33012 [==============================] - 1s 44us/step - loss: 2.0534e-04
from tensorflow.keras.layers import Input, Dense, CuDNNLSTM, InputLayer
from tensorflow.keras.models import Sequential
seq_len = 5
X_train = np.array([data_train[i : i + seq_len, 1:] for i in range(data_train.shape[0] - seq_len)])#[:, :, np.newaxis]
y_train = np.array([data_train[i + seq_len, 0] for i in range(data_train.shape[0] - seq_len)])
X_test = np.array([data_test[i : i + seq_len, 1:] for i in range(data_test.shape[0] - seq_len)]) #[:, :, np.newaxis]
y_test = np.array([data_test[i + seq_len, 0] for i in range(data_test.shape[0] - seq_len)])
X_train.shape
(33007, 5, 500)
lstm_model = Sequential(layers=[
    InputLayer(input_shape=X_train.shape[1:]),
    CuDNNLSTM(128),
    Dense(1, activation='linear'),
])
lstm_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
cu_dnnlstm (CuDNNLSTM)       (None, 128)               322560    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
=================================================================
Total params: 322,689
Trainable params: 322,689
Non-trainable params: 0
_________________________________________________________________
lstm_model.compile(loss=MSE, optimizer=Adam())
def on_epoch_end(epoch, logs):
    y_pred = lstm_model.predict(X_train)
    plt.plot(y_train, label='test')
    plt.plot(y_pred, label='pred')
    plt.legend()
    plt.show()
lstm_model.fit(X_train, y_train,
          epochs=5,
          batch_size=256,
          shuffle=True,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)]
              )
Epoch 1/5
32512/33007 [============================>.] - ETA: 0s - loss: 0.0197

33007/33007 [==============================] - 3s 101us/step - loss: 0.0194
Epoch 2/5
32256/33007 [============================>.] - ETA: 0s - loss: 3.2466e-04

33007/33007 [==============================] - 3s 80us/step - loss: 3.2286e-04
Epoch 3/5
31744/33007 [===========================>..] - ETA: 0s - loss: 2.3210e-04

33007/33007 [==============================] - 3s 80us/step - loss: 2.3420e-04
Epoch 4/5
32512/33007 [============================>.] - ETA: 0s - loss: 2.0556e-04

33007/33007 [==============================] - 3s 76us/step - loss: 2.0526e-04
Epoch 5/5
31488/33007 [===========================>..] - ETA: 0s - loss: 1.9163e-04

33007/33007 [==============================] - 3s 80us/step - loss: 1.9080e-04

<tensorflow.python.keras.callbacks.History at 0x7f7f2852bf98>

附:

%matplolib inline  # 在jupyter notebook上面显示图片

以上是关于深度学习与图神经网络核心技术实践应用高级研修班-Day2股票预测(stock_prediction)的主要内容,如果未能解决你的问题,请参考以下文章

深度学习与图神经网络核心技术实践应用高级研修班-Day3迁移学习(Transfer Learning)

深度学习与图神经网络核心技术实践应用高级研修班-Day1典型深度神经网络模型

深度学习与图神经网络核心技术实践应用高级研修班-Day1Tensorflow和Pytorch

深度学习与图神经网络核心技术实践应用高级研修班-Day2基于Keras的深度学习程序开发

深度学习与图神经网络核心技术实践应用高级研修班-Day1受限波尔兹曼机和深度信念网络

深度学习与图神经网络核心技术实践应用高级研修班-Day4深度强化学习(Deep Q-learning)