在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)

Posted

技术标签:

【中文标题】在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)【英文标题】:Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0) 【发布时间】:2020-10-16 15:17:03 【问题描述】:

我正在尝试使用 LSTM 来预测道琼斯工业平均指数在未来几个月的表现。我认为将其作为一个时间序列场景是合适的,因为 DJIA 的行为就像一只股票,我的数据值在时间上分布均匀。我只是一个初学者,所以只从一个功能开始(每日收盘价)。现在我知道股票是非常随机的,很难很好地预测它们。而且,仅关闭值本身并不能提供很多信息……但我稍后会添加其他功能。

数据集: DJIA 历史数据,1985 年 1 月 28 日 - 2020 年 6 月 24 日,可在此处下载:https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI。

使用 matplotlib 进行可视化:

我使用一系列接近值 (number = 'sequence_length') 来预测紧随系列 (sequence_length + 1) 的接近值。例如,我使用第 0-29 天预测第 30 天,使用第 1-30 天预测第 31 天等。换句话说,我对数据进行分区,使得 x_train[0] 包含第 0-29 天的接近值,而 y_train [0] 包含第 31 天的单个值。好的。所以这是我在 test 数据上运行模型后得到的结果:

表面上很好,但我想知道整个概念是否存在缺陷:模型是否只是重复地查看数据,而没有学习任何潜在的模式?请参阅下文,了解 DJIA 对 2020 年 7 月至 20201 年 4 月的收盘预测。在我看来,预测曲线模仿了测试数据的确切形状,低于 20,000 点,而且全部...

问题

    这个模型有效吗?是更改参数还是重新格式化数据? 您如何评估这样的模型?显然“准确度”是一个无效的指标。损失曲线见下图 建议不要对标签使用标量关闭值,而是使用序列。例如,x_train[0] 可能包括第 0-29 天的收盘价,而 y_train[0] 可能包括第 30-60 天的收盘价。我一直在徒劳地尝试完成这项工作,但显然不知道该怎么做。我试图制作 y_test 和 y_train Numpy 数组,包括序列数据数组 - 像这样:
y_train, y_test = [], []
    
for i in range(sequence_length, len(training_set_scaled)):
    y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
    y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
    
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

非常感谢任何帮助,也许我们都可以受益($)。开玩笑……有点。

守则

df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot

# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction

# Split off the training set and scale it. 
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)

# Build 3D training set. Final shape: (examples, sequence_length, 1) 
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))]) 
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))

# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)

x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1). 

# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )

# Build Model
epochs: int = 150
batch_size: int = 32

LSTM_1 = LSTM(
    units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
    input_shape = (x_train.shape[1], 1),
    return_sequences = False,
    )

LSTM_2 = LSTM(
    units = 10
    )

model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))

model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error', 
             optimizer = 'adam', 
             )

early_stopping = EarlyStopping(monitor='val_loss', 
                               mode='min', 
                               verbose = 1, 
                               patience = 9,
                               restore_best_weights = False
                               )

history = model.fit(x_train,
          y_train,
          epochs = epochs, 
          batch_size = batch_size,
          verbose = 2, 
          validation_split = 0.20,
          # validation_data = (x_test, y_test),
          callbacks = [early_stopping],
          )

# Evaluate performance 
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)

# early_stopping.stopped_epoch returns 0 if training didn't stop early. 
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)

y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)

test_dates = adj_dates[-x_test.shape[0]:]

# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()

# Generate future data 
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]

last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
 
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()

# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps

# Convert list of timestamps to list of strings 
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings

# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)

plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()




【问题讨论】:

【参考方案1】:

案例 1:在问题的开头,您提到“例如,我使用 0-29 天预测第 30 天,使用第 1-30 天预测第 31 天,等等。”。

案例 2: 但在 Question 3 中,您提到“例如,x_train[0] 可能包括第 0-29 天的收盘价,而 y_train[0] 可能包括几天的收盘价30-60。”。

您要预测Closed ValueNext Day 还是Closed ValueNext 30 Days

要生成 X 和 Y 的数据(训练和测试),您可以使用下面提到的函数:

def univariate_data(dataset, start_index, end_index, history_size, target_size):
  data = []
  labels = []

  start_index = start_index + history_size
  if end_index is None:
    end_index = len(dataset) - target_size

  for i in range(start_index, end_index):
    indices = range(i-history_size, i)
    # Reshape data from (history_size,) to (history_size, 1)
    data.append(np.reshape(dataset[indices], (history_size, 1)))
    labels.append(dataset[i+target_size])
  return np.array(data), np.array(labels)

参数的值,history_size 将是 30,target_size 的值对于 Case 1 将是 1,对于 Case 2 将是 30(如上所述)。

您需要调用该函数一次进行训练,一次调用测试,如下所示:

univariate_past_history = 30

univariate_future_target = 1 or 30

x_train_uni, y_train_uni = univariate_data(data, 0, TRAIN_SPLIT,
                                           univariate_past_history,
                                           univariate_future_target)
x_val_uni, y_val_uni = univariate_data(data, TRAIN_SPLIT, None,
                                       univariate_past_history,
                                       univariate_future_target)

请找到此Tensorflow Tutorial,它全面解释了Univariate(一列)和Multi Variate(多列)Time Series Analysis 以及逐步代码。

按照您提出的顺序回答您的问题:

    是的。推荐Tutorial 会有所帮助。

    是的,Accuracy 是无效指标。可以使用MAEMSE,如下图:

    simple_lstm_model.compile(optimizer='adam', loss='mae')

    我们应该使用Numpy Arrays 而不是Sequences

如果您遇到任何其他问题,请告诉我,我们很乐意为您提供帮助。

【讨论】:

以上是关于在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 LSTM Keras 预测未来库存

Keras LSTM 预测的时间序列被挤压和移位

基于LSTM深层神经网络的时间序列预测

在 keras 中使用 LSTM 进行预测

如何处理keras中多元LSTM中的多步时间序列预测

Keras LSTM:如何预测超越验证与预测?