tensorflow:找不到dnn实现
Posted
技术标签:
【中文标题】tensorflow:找不到dnn实现【英文标题】:tensorflow: Fail to find dnn implementation 【发布时间】:2019-10-13 11:04:47 【问题描述】:我正在尝试使用 gpu 在 tensorflow 上运行我的代码 Keras CuDNNGRU,但即使我已经安装了 CUDA 和 CuDNN,它也总是出现错误“无法找到 dnn 实现”。
我已经多次重新安装 CUDA 和 CuDNN 并将 CuDNN 版本从 7.2.1 升级到 7.5.0,但它没有解决任何问题。我还尝试在 Jupyter Notebook 和 python 编译器(在终端上)中运行我的代码,并且两个结果都是相同的。这是我的硬件和软件的详细信息。
-
特斯拉 V100 PCIE 16GB
Ubuntu 18.04
NVIDIA-SMI 384.183
CUDA 9.0
CuDNN 7.5.0
迷你康达 3
Python 3.6
张量流 1.12
Keras 2.1.6
这是我的代码。
encoder_LSTM = tf.keras.layers.CuDNNGRU(hidden_unit,return_sequences=True,return_state=True)
encoder_LSTM_rev=tf.keras.layers.CuDNNGRU(hidden_unit,return_state=True,return_sequences=True,go_backwards=True)
encoder_outputs, state_h = encoder_LSTM(x)
encoder_outputsR, state_hR = encoder_LSTM_rev(x)
这是错误信息。
2019-05-27 19:08:06.814896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-27 19:08:06.814956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 19:08:06.814971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-27 19:08:06.814978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-05-27 19:08:06.815279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14678 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)
2019-05-27 19:08:08.050226: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050350: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050378: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: Fail to find the dnn implementation.
2019-05-27 19:08:08.050483: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050523: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050541: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node cu_dnngru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[node mean_squared_error/value/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ta_skenario1.py", line 271, in <module>
losss, op = sess.run([loss, optimizer], feed_dict=x:data,y_label:label,initial_input:begin_sentence)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205) = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[node mean_squared_error/value/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'cu_dnngru/CudnnRNN', defined at:
File "ta_skenario1.py", line 205, in <module>
encoder_outputs, state_h = encoder_LSTM(x)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 619, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 109, in call
output, states = self._process_batch(inputs, initial_state)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 299, in _process_batch
rnn_mode='gru')
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 116, in cudnn_rnn
is_training=is_training, name=name)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Fail to find the dnn implementation.
[[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205) = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[node mean_squared_error/value/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
有什么想法吗?谢谢
更新:我尝试将 CuDNN 版本从 7.5.0 降级到 7.1.4,但结果保持不变。
【问题讨论】:
【参考方案1】:使用 TF 2.0 配置您的 GPU 以实现增长对我很有效。几个月前,当我在运行 TF 2.0 之前遇到问题时,我在另一个问题中找到了这个解决方案。不记得在哪里。
添加以下内容可能会很好。
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
【讨论】:
【参考方案2】:不确定它是否有帮助,但在我的情况下,问题是由使用多个 jupyter 笔记本文件引起的。
我正在为神经网络编写一个简单的代码,我决定将它分成 2 个笔记本,一个用于训练,一个用于预测(如果您没有资源/时间来训练您的网络,我提供了我的将模型保存在文件中)。
如果我“一起”运行这两个笔记本,那么基本上首先是训练,然后是预测,而不断开第一个代码的内核,我会得到这个错误。
在使用第二个之前断开第一个 jupyter notebook 的内核解决了我的问题。
【讨论】:
对我来说这是一个类似的问题。我正在运行一个笔记本。然后我尝试运行python脚本并收到此错误。关闭笔记本内核后,脚本按预期工作【参考方案3】:这在 Tensorflow 2 中对我有用,正如 here 建议的那样
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
【讨论】:
.set_memory_growth()
给了我两个 GPU 的错误,所以我改用:.set_visible_devices(physical_devices[0], device_type='GPU')
,这对我来说效果很好。【参考方案4】:
您是否测试过您的安装(cuda、cudnn、tensorflow-gpu)?
测试 cuda: 首先检查是否:
$ nvcc -V
显示您的 cuda 工具包的正确版本。 然后就可以用下面的流程来测试了:
首先(需要几分钟):
$ cd ~/NVIDIA_CUDA-9.0_Samples
$ make
然后:
$ cd ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release
$./deviceQuery
如果你最后得到:“结果:通过”,你就没事了!
测试 cudnn:
$ cp -r /usr/src/cudnn_samples_v7/ $HOME
$ cd $HOME/cudnn_samples_v7/mnistCUDNN
$ make clean && make
$ ./mnistCUDNN
结果应该是:'测试通过!'
测试 tensorflow-gpu:
如果 cuda 和 cudnn 正常工作,您可以使用以下命令测试您的 tensorflow 安装:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
我建议你在 conda 环境中安装 tensorflow:
conda create --name tf_gpu tensorflow-gpu
对我来说(在遇到很多问题之后)它运行得很好。
来源: gpu installation for Ubuntu 18.04, tensorflow-gpu installation
【讨论】:
我尝试了您的所有建议,所有测试都成功了。但它仍然错误。所以我尝试在 conda 之外安装 tensorflow-gpu。现在可以了。谢谢你的回答【参考方案5】:对于使用 TF2.0 和 Cuda 10.0 使用 cuDNN-7 遇到此问题的任何人,您可能会遇到此问题,因为您不小心升级了cuDNN 从 7.6.2
到 >7.6.5
。尽管 TF 文档声明任何 >=7.4.1
都在工作,但事实并非如此!降级到CudNN如下:
sudo apt-get install --no-install-recommends \
cuda-10-0 \
libcudnn7=7.6.2.24-1+cuda10.0 \
libcudnn7-dev=7.6.2.24-1+cuda10.0
在未来,您可以通过在 aptitude
中标记它们来暂停 Ubuntu/Debian 中对 cuDNN 的更新:
sudo apt-mark hold libcudnn7 libcudnn7-dev
【讨论】:
以上是关于tensorflow:找不到dnn实现的主要内容,如果未能解决你的问题,请参考以下文章
在检查点错误中找不到密钥 dnn/hiddenlayer_0/bias
如何保存Tensorflow中的Tensor参数,保存训练中的中间参数,存储卷积层的数据