CuDNN——状态未初始化(Keras/TensorFlow + Nvidia P100 + Linux)

Posted

技术标签:

【中文标题】CuDNN——状态未初始化(Keras/TensorFlow + Nvidia P100 + Linux)【英文标题】:CuDNN -- Status Not Intitialized (Keras/TensorFlow + Nvidia P100 + Linux) 【发布时间】:2019-02-15 07:35:58 【问题描述】:

我无法通过 keras+tensorflow-backend 将我的(工作)LSTM 模型转换为利用 CuDNN。我正在使用:

张量流 1.10.1 Tensorflow-GPU 1.10.1 Keras 2.2.2 库达 9.2 CuDNN 7.2.1(非常确定) NVIDIA P100 GPU(驱动程序 390.87)。

代码示例:

def build_lstm(num_neurons, dropout, recurent_dropout):
    model = Sequential()
    model.add(LSTM(num_neurons, input_shape=(12,1), dropout=dropout, recurrent_dropout=recurent_dropout, unroll=True))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

def build_cudnnlstm(num_neurons, dropout, recurent_dropout):
    model = Sequential()
    model.add(CuDNNLSTM(num_neurons, input_shape=(12,1)))
    model.add(Dropout(dropout))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

但是,当我将 build_cudnnlstm 换成 build_lstm 时,出现以下错误:

Epoch 1/5 2018-09-10 15:58:53.726819: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-09-10 15:58:54.001406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:17:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-09-10 15:58:54.001491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-10 15:58:54.475955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-10 15:58:54.476019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-09-10 15:58:54.476036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-09-10 15:58:54.476408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15123 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:17:00.0, compute capability: 6.0)
2018-09-10 15:58:55.098145: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-09-10 15:58:55.098409: E tensorflow/stream_executor/cuda/cuda_dnn.cc:360] Possibly insufficient driver version: 390.87.0
2018-09-10 15:58:55.098496: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.6/site-packages/keras/engine/training.py", line 1037, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2666, in __call__
    return self._call(inputs)
  File "/usr/local/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2636, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1382, in __call__
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
         [[Node: cu_dnnlstm_1/CudnnRNN = CudnnRNN[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/cu_dnnlstm_1/CudnnRNN_grad/CudnnRNNBackprop"], direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="lstm", seed=87654321, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnnlstm_1/transpose, cu_dnnlstm_1/ExpandDims_1, cu_dnnlstm_1/ExpandDims_1, cu_dnnlstm_1/concat_1)]]
         [[Node: loss/mul/_79 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_782_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

在拟合过程中打印此错误:

model.fit(samples, targets_1q, epochs=epochs, shuffle=True, verbose=2)

非常感谢任何帮助!

【问题讨论】:

【参考方案1】:

也许你应该升级你的驱动,我记得396.37是对应Cuda 9.2的版本。

【讨论】:

终于开始更新了,但是是的:9.2 至少需要 396。 Nvidia 团队确认了解决方案。【参考方案2】:

看看我的笔记,我似乎曾经遇到过这个问题并使用以下方法进行了纠正:

pip3 install --upgrade tensorflow

pip3 install --upgrade tensorflow-gpu

您的里程数因人而异。

检查您的 CUDNN 版本很简单——您知道 CUDA 安装在哪里吗?如果是这样,请查看您移入该目录的 CUDNN 标头。

【讨论】:

不幸的是,升级失败了。 which nvcc 指向 /usr/local/cuda/bin/nvcc 并且 cudnn.h 包含在 /usr/local/cuda-9.2/include/ 中。这两个目录不一样是不是有问题? 不,CUDNN 标头应该在您的 CUDA 目录中,所以我认为这不是问题。抱歉,我无法提供更多帮助。

以上是关于CuDNN——状态未初始化(Keras/TensorFlow + Nvidia P100 + Linux)的主要内容,如果未能解决你的问题,请参考以下文章

如何解决“cuDNN 未启用”

未找到模块'cudnn'

已安装 Tensorflow-gpu、CUDA 和 cudnn,但发现 GPU 设备但未使用 [重复]

获取卷积算法失败。这可能是因为 cuDNN 初始化失败,

Google Colab Error : Failed to get convolution algorithm。这可能是因为 cuDNN 初始化失败

Google Colab - Tensorflow model_main_tf2:无法获得卷积算法。这可能是因为 cuDNN 未能初始化