在 Tensorflow 中运行 LSTM 时出现 ResourceExhausted 错误或 OOM

Posted

技术标签:

【中文标题】在 Tensorflow 中运行 LSTM 时出现 ResourceExhausted 错误或 OOM【英文标题】:ResourceExhausted Error or OOM when running LSTM in Tensorflow 【发布时间】:2018-10-14 23:53:53 【问题描述】:

我正在使用以下代码在 Tensorflow 中训练我的 LSTM 网络:

import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from scipy import stats
import tensorflow as tf
import seaborn as sns
from pylab import rcParams
from sklearn import metrics
from sklearn.model_selection import train_test_split

%matplotlib inline

sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 14, 8

RANDOM_SEED = 42

columns = ['user','activity','timestamp', 'x-axis', 'y-axis', 'z-axis']
df = pd.read_csv('data/WISDM_ar_v1.1_raw.txt', header = None, names = columns)
df = df.dropna()

df.head()

df.info()

##df['activity'].value_counts().plot(kind='bar', title='Training examples by activity type');
##df['user'].value_counts().plot(kind='bar', title='Training examples by user');

def plot_activity(activity, df):
    data = df[df['activity'] == activity][['x-axis', 'y-axis', 'z-axis']][:200]
    axis = data.plot(subplots=True, figsize=(16, 12), 
                     title=activity)
    for ax in axis:
        ax.legend(loc='lower left', bbox_to_anchor=(1.0, 0.5))


##plot_activity("Sitting", df)
##plot_activity("Standing", df)
##plot_activity("Walking", df)
##plot_activity("Jogging", df)


N_TIME_STEPS = 200
N_FEATURES = 3
step = 20
segments = []
labels = []
for i in range(0, len(df) - N_TIME_STEPS, step):
    xs = df['x-axis'].values[i: i + N_TIME_STEPS]
    ys = df['y-axis'].values[i: i + N_TIME_STEPS]
    zs = df['z-axis'].values[i: i + N_TIME_STEPS]
    label = stats.mode(df['activity'][i: i + N_TIME_STEPS])[0][0]
    segments.append([xs, ys, zs])
    labels.append(label)

np.array(segments).shape

reshaped_segments = np.asarray(segments, dtype= np.float32).reshape(-1, N_TIME_STEPS, N_FEATURES)
labels = np.asarray(pd.get_dummies(labels), dtype = np.float32)

reshaped_segments.shape
labels[0]

X_train, X_test, y_train, y_test = train_test_split(
        reshaped_segments, labels, test_size=0.2, random_state=RANDOM_SEED)

len(X_train)
len(X_test)

N_CLASSES = 6
N_HIDDEN_UNITS = 64


def create_LSTM_model(inputs):
    W = 
        'hidden': tf.Variable(tf.random_normal([N_FEATURES, N_HIDDEN_UNITS])),
        'output': tf.Variable(tf.random_normal([N_HIDDEN_UNITS, N_CLASSES]))
    
    biases = 
        'hidden': tf.Variable(tf.random_normal([N_HIDDEN_UNITS], mean=1.0)),
        'output': tf.Variable(tf.random_normal([N_CLASSES]))
    

    X = tf.transpose(inputs, [1, 0, 2])
    X = tf.reshape(X, [-1, N_FEATURES])
    hidden = tf.nn.relu(tf.matmul(X, W['hidden']) + biases['hidden'])
    hidden = tf.split(hidden, N_TIME_STEPS, 0)

    # Stack 2 LSTM layers
    lstm_layers = [tf.contrib.rnn.BasicLSTMCell(N_HIDDEN_UNITS, forget_bias=1.0) for _ in range(2)]
    lstm_layers = tf.contrib.rnn.MultiRNNCell(lstm_layers)

    outputs, _ = tf.contrib.rnn.static_rnn(lstm_layers, hidden, dtype=tf.float32)

    # Get output for the last time step
    lstm_last_output = outputs[-1]

    return tf.matmul(lstm_last_output, W['output']) + biases['output']


tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, N_TIME_STEPS, N_FEATURES], name="input")
Y = tf.placeholder(tf.float32, [None, N_CLASSES])


pred_Y = create_LSTM_model(X)

pred_softmax = tf.nn.softmax(pred_Y, name="y_")

loss = -tf.reduce_sum(Y * tf.log(pred_softmax))
optimizer = tf.train.GradientDescentOptimizer(learning_rate = LEARNING_RATE).minimize(loss)

correct_prediction = tf.equal(tf.argmax(pred_softmax,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

cost_history = np.empty(shape=[1],dtype=float)
saver = tf.train.Saver()

session = tf.Session()
session.run(tf.global_variables_initializer())

batch_size = 10
total_batches = X_train.shape[0] // batch_size


for epoch in range(8):
        for b in range(total_batches):    
            offset = (b * batch_size) % (y_train.shape[0] - batch_size)
            batch_x = X_train[offset:(offset + batch_size), :]
            batch_y = y_train[offset:(offset + batch_size), :]
            _, c = session.run([optimizer, loss],feed_dict=X: batch_x, Y : batch_y)
            cost_history = np.append(cost_history,c)
        print("Epoch: ",epoch," Training Loss: ",c," Training Accuracy: ",\
              session.run(accuracy, feed_dict=X: X_train, Y: y_train))

我使用的数据集来自http://www.cis.fordham.edu/wisdm/dataset.php

WISDM_ar_txtv1.1_raw

但是,当我运行它时,我收到 ResourceExhausted 或 OOM 错误:

Traceback(最近一次调用最后一次):文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", 第 1350 行,在 _do_call 中 返回 fn(*args) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", 第 1329 行,在 _run_fn 状态,run_metadata)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py”, 第 473 行,在 退出 c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 当使用 shape[8784000,64] 分配张量并输入 float on /job:localhost/replica:0/task:0/device:GPU:0 分配器 GPU_0_bfc [[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 转置_b=假, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] 提示:如果你想查看已分配张量的列表 当 OOM 发生时,将 report_tensor_allocations_upon_oom 添加到 RunOptions 获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果您想在 OOM 发生时查看分配的张量列表,请添加 report_tensor_allocations_upon_oom 到 RunOptions 的当前 分配信息。

在处理上述异常的过程中,又发生了一个异常:

Traceback(最近一次调用最后一次):文件“”,第 9 行,in session.run(accuracy, feed_dict=X: X_train, Y: y_train)) 文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\会话.py", 第 895 行,运行中 run_metadata_ptr) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", 第 1128 行,在 _run feed_dict_tensor, options, run_metadata) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", 第 1344 行,在 _do_run 选项,run_metadata)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py”, 第 1363 行,在 _do_call 中 raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 当使用 shape[8784000,64] 分配张量并输入 float on /job:localhost/replica:0/task:0/device:GPU:0 分配器 GPU_0_bfc [[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 转置_b=假, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] 提示:如果你想查看已分配张量的列表 当 OOM 发生时,将 report_tensor_allocations_upon_oom 添加到 RunOptions 获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果您想在 OOM 发生时查看分配的张量列表,请添加 report_tensor_allocations_upon_oom 到 RunOptions 的当前 分配信息。

由操作“MatMul”引起,定义在:文件“”,第 1 行,在 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py", 第 130 行,主要 ret = 方法(*args,**kwargs)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py”, 第 357 行,在运行代码中 exec(code, self.locals) File "", line 1, in File "", line 13, in create_LSTM_model File "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py", 第 2022 行,在 matmul 中 a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", 第 2799 行,在 _mat_mul 名称=名称)文件“C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py”, 第 787 行,在 _apply_op_helper op_def=op_def) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", 第 3160 行,在 create_op 中 op_def=op_def) 文件 "C:\Users\Chaine\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", 第 1625 行,在 init 中 self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError(回溯见上文):分配时出现 OOM 形状为 [8784000,64] 的张量和类型 float on /job:localhost/replica:0/task:0/device:GPU:0 分配器 GPU_0_bfc [[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 转置_b=假, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable/read)]] 提示:如果你想查看已分配张量的列表 当 OOM 发生时,将 report_tensor_allocations_upon_oom 添加到 RunOptions 获取当前分配信息。

[[节点:add_1/_15 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9637_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 提示:如果您想在 OOM 发生时查看分配的张量列表,请添加 report_tensor_allocations_upon_oom 到 RunOptions 的当前 分配信息。

什么可能导致这个错误?

更新:我在另一台机器上运行我的代码,它没有给出错误。

【问题讨论】:

【参考方案1】:

您的代码中存在重大问题。您正面临这个问题,因为您没有静态图 - 这意味着您在执行 for 循环时会不断添加新图。如果您在

中跟踪您如何评估您的损失值
session.run([loss]), 

你会注意到你正在跑步

pred_Y = create_LSTM_model(X)

在您执行 for 循环时多次执行部分代码。

你不想这样做。您应该修改您的代码,以便您可以从图表中提取损失参数,而无需重新创建图表。

希望对你有帮助。

【讨论】:

您好,我在另一台笔记本电脑上运行了我的代码,它工作正常。这是为什么呢?

以上是关于在 Tensorflow 中运行 LSTM 时出现 ResourceExhausted 错误或 OOM的主要内容,如果未能解决你的问题,请参考以下文章

如何在 LSTM 中实现 Tensorflow 批量归一化

教程 | 使用MNIST数据集,在TensorFlow上实现基础LSTM网络

如何从 pandas 数据帧在 tensorflow v1 中实现 LSTM

在 android (Java 8) 上运行 TensorFlow Lite 时出现 java.lang.NoSuchMethodError

Tensorflow 对象检测 API - 当我尝试运行 model_builder_test.py 时出现 ImportError

尝试使用 tensorflow 运行教程 CNN 时出现 cuDNN_STATUS_ALLOC_FAILED