在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)
Posted
技术标签:
【中文标题】在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)【英文标题】:Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0) 【发布时间】:2020-10-16 15:17:03 【问题描述】:我正在尝试使用 LSTM 来预测道琼斯工业平均指数在未来几个月的表现。我认为将其作为一个时间序列场景是合适的,因为 DJIA 的行为就像一只股票,我的数据值在时间上分布均匀。我只是一个初学者,所以只从一个功能开始(每日收盘价)。现在我知道股票是非常随机的,很难很好地预测它们。而且,仅关闭值本身并不能提供很多信息……但我稍后会添加其他功能。
数据集: DJIA 历史数据,1985 年 1 月 28 日 - 2020 年 6 月 24 日,可在此处下载:https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI。
使用 matplotlib 进行可视化:
我使用一系列接近值 (number = 'sequence_length') 来预测紧随系列 (sequence_length + 1) 的接近值。例如,我使用第 0-29 天预测第 30 天,使用第 1-30 天预测第 31 天等。换句话说,我对数据进行分区,使得 x_train[0] 包含第 0-29 天的接近值,而 y_train [0] 包含第 31 天的单个值。好的。所以这是我在 test 数据上运行模型后得到的结果:
表面上很好,但我想知道整个概念是否存在缺陷:模型是否只是重复地查看数据,而没有学习任何潜在的模式?请参阅下文,了解 DJIA 对 2020 年 7 月至 20201 年 4 月的收盘预测。在我看来,预测曲线模仿了测试数据的确切形状,低于 20,000 点,而且全部...
问题
-
这个模型有效吗?是更改参数还是重新格式化数据?
您如何评估这样的模型?显然“准确度”是一个无效的指标。损失曲线见下图
建议不要对标签使用标量关闭值,而是使用序列。例如,x_train[0] 可能包括第 0-29 天的收盘价,而 y_train[0] 可能包括第 30-60 天的收盘价。我一直在徒劳地尝试完成这项工作,但显然不知道该怎么做。我试图制作 y_test 和 y_train Numpy 数组,包括序列数据数组 - 像这样:
y_train, y_test = [], []
for i in range(sequence_length, len(training_set_scaled)):
y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
非常感谢任何帮助,也许我们都可以受益($)。开玩笑……有点。
守则
df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot
# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction
# Split off the training set and scale it.
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)
# Build 3D training set. Final shape: (examples, sequence_length, 1)
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))])
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)
x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1).
# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )
# Build Model
epochs: int = 150
batch_size: int = 32
LSTM_1 = LSTM(
units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
input_shape = (x_train.shape[1], 1),
return_sequences = False,
)
LSTM_2 = LSTM(
units = 10
)
model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))
model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
)
early_stopping = EarlyStopping(monitor='val_loss',
mode='min',
verbose = 1,
patience = 9,
restore_best_weights = False
)
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
verbose = 2,
validation_split = 0.20,
# validation_data = (x_test, y_test),
callbacks = [early_stopping],
)
# Evaluate performance
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)
# early_stopping.stopped_epoch returns 0 if training didn't stop early.
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)
y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)
test_dates = adj_dates[-x_test.shape[0]:]
# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate future data
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]
last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps
# Convert list of timestamps to list of strings
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings
# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)
plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
【问题讨论】:
【参考方案1】:案例 1:在问题的开头,您提到“例如,我使用 0-29 天预测第 30 天,使用第 1-30 天预测第 31 天,等等。”。
案例 2: 但在 Question 3
中,您提到“例如,x_train[0] 可能包括第 0-29 天的收盘价,而 y_train[0] 可能包括几天的收盘价30-60。”。
您要预测Closed Value
的Next Day
还是Closed Value
的Next 30 Days
。
要生成 X 和 Y 的数据(训练和测试),您可以使用下面提到的函数:
def univariate_data(dataset, start_index, end_index, history_size, target_size):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index):
indices = range(i-history_size, i)
# Reshape data from (history_size,) to (history_size, 1)
data.append(np.reshape(dataset[indices], (history_size, 1)))
labels.append(dataset[i+target_size])
return np.array(data), np.array(labels)
参数的值,history_size
将是 30,target_size
的值对于 Case 1
将是 1,对于 Case 2
将是 30(如上所述)。
您需要调用该函数一次进行训练,一次调用测试,如下所示:
univariate_past_history = 30
univariate_future_target = 1 or 30
x_train_uni, y_train_uni = univariate_data(data, 0, TRAIN_SPLIT,
univariate_past_history,
univariate_future_target)
x_val_uni, y_val_uni = univariate_data(data, TRAIN_SPLIT, None,
univariate_past_history,
univariate_future_target)
请找到此Tensorflow Tutorial,它全面解释了Univariate
(一列)和Multi Variate
(多列)Time Series Analysis
以及逐步代码。
按照您提出的顺序回答您的问题:
是的。推荐Tutorial 会有所帮助。
是的,Accuracy
是无效指标。可以使用MAE
或MSE
,如下图:
simple_lstm_model.compile(optimizer='adam', loss='mae')
我们应该使用Numpy Arrays
而不是Sequences
。
如果您遇到任何其他问题,请告诉我,我们很乐意为您提供帮助。
【讨论】:
以上是关于在 Keras 中使用 LSTM 预测股票(Python 3.7、Tensorflow 2.1.0)的主要内容,如果未能解决你的问题,请参考以下文章