如何将 Pandas Dataframe 转换为 Keras RNN 以解决多变量分类问题

Posted

技术标签:

【中文标题】如何将 Pandas Dataframe 转换为 Keras RNN 以解决多变量分类问题【英文标题】:How to Convert Pandas Dataframe to Keras RNN for Multivariate classification Problems 【发布时间】:2021-01-25 12:49:15 【问题描述】:

我有一个 pandas 数据框,我想制作一个循环神经网络模型。谁能向我解释我们如何将 pandas 数据帧转换为序列?

我检查了几个地方以及它仅解释的所有地方,RNN 如何处理简单数组,而不是 pandas 数据框。我的目标变量是“标签”列,它确实有 5 个变量。

以下是我的代码,当我尝试执行 model.fit 时出现错误。我在这里附上一张图片来检查。

import numpy
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from sklearn import metrics
# fix random seed for reproducibility
numpy.random.seed(7)

AllDataSelFeLabEncDataframe
    Flow_IAT_Max    Fwd_IAT_Std   Pkt_Len_Max   Fwd_Pkt_Len_Std   Label
0   591274.0        11125.35538   32             0.0                3
1   633973.0        12197.74612   32             0.0                3
2   591242.0        12509.82212   32             0.0                3
3   2.0             0.0           0              0.0                2
4   1.0             0.0           0              0.0                2
5   460.0           0.000000      0              0.000000           1
6   10551.0         311.126984    326            188.216188         1
7   476.0           0.000000      0              0.000000           1
8   4380481.0       2185006.405   935            418.144712         0
9   4401241.0       2192615.483   935            418.144712         0
10  3364844.0       1675797.985   935            418.144712         0
11  4380481.0       2185006.405   935            418.144712         0
12  43989.0         9929.900528    0             0.0                4

# define y variable, i.e., what I want to predict
y_col='Label' 

X = AllDataSelFeLabEnc.drop(y_col,axis=1).copy()
y = AllDataSelFeLabEnc[[y_col]].copy() 
# the double brakets here are to keep the y in dataframe format, otherwise it will be pandas Series
print(X.shape,y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

length = 500


n_input = 25 #how many samples/rows/timesteps to look in the past in order to forecast the next sample
n_features= X_train.shape[1] # how many predictors/Xs/features we have to predict y
b_size = 32 # Number of timeseries samples in each batch


# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(5000, embedding_vecor_length, input_length=length))
model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())


model.fit(X_train, y_train, epochs=3, batch_size=64)

[![Error I'm getting][1]][1]


# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


y_pred = model.predict(X_test)

# Print the confusion matrix
print(metrics.confusion_matrix(y_test,y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_test, y_pred, digits=3))

【问题讨论】:

【参考方案1】:

来自 LSTM 的 keras 文档

输入:形状为 [batch, timesteps, feature] 的 3D 张量。

所以在你的情况下需要的是 [32, 25, 4] 或 [n_features, n_input, n_features]

我认为这种表示不可能用数据帧进行,除非将输入数据转换为数据帧数组s

所以这是使用 numpy 的方法,我认为这是最简单且有效的方法-

# .loc includes the last element too, so we subtract 1
# the math handles the end case. When the data samples are not a multiple of timestamps you a want to use in a shot 
x = X_train.loc[:(len(X_train)//n_input)*n_input-1, INPUT_FEATURES].to_numpy()
X_train = np.reshape(x, (len(X_train)//n_input, n_input, n_features))

注意

上面的代码不执行滚动窗口,而是窗口切片,即,如果你有 50 个样本,你只会得到 2 个样本而不是 26 个样本 1-25、2-26、3-27 等等 26-50

【讨论】:

以上是关于如何将 Pandas Dataframe 转换为 Keras RNN 以解决多变量分类问题的主要内容,如果未能解决你的问题,请参考以下文章

如何将嵌套字典转换为 pandas DataFrame?

如何将可变长度列表的 Pandas DataFrame 列(或系列)转换为固定宽度的 DataFrame [重复]

如何将 Pydantic BaseModels 列表转换为 Pandas Dataframe

如何将 pandas.core.frame.DataFrame 转换为列表?

如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?

如何将 pandas DataFrame 转换为省略 NaN 值的字典列表?