无法在 GPU 上运行 tensorflow
Posted
技术标签:
【中文标题】无法在 GPU 上运行 tensorflow【英文标题】:Cannot run tensorflow on GPU 【发布时间】:2018-04-12 22:39:54 【问题描述】:我想在我的 GPU 上运行 tensorflow
代码,但它不工作。我已经安装了 Cuda 和 cuDNN,并且还有兼容的 GPU。
我在Tensorflow tutorial for GPUTensorflow tutorial for GPU的官方网站教程中获取了这个示例
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
这是我的输出:
Device mapping: no known devices.
2017-10-31 16:15:40.298845: I tensorflow/core/common_runtime/direct_session.cc:300] Device mapping:
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
2017-10-31 16:15:56.895802: I tensorflow/core/common_runtime/simple_placer.cc:872] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-31 16:15:56.895910: I tensorflow/core/common_runtime/simple_placer.cc:872] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a_1: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-31 16:15:56.895961: I tensorflow/core/common_runtime/simple_placer.cc:872] a_1: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-31 16:15:56.896006: I tensorflow/core/common_runtime/simple_placer.cc:872] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22. 28.]
[ 49. 64.]]
没有在我的 GPU 上运行的选项。我试图强制它使用这个手动在 GPU 上运行:
with tf.device('/gpu:0'):
...
它给出了一堆错误:
Traceback (most recent call last):
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1297, in _run_fn
self._extend_graph()
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "/home/abhor/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'MatMul_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
[[Node: MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](a_2, b_1)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'MatMul_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
[[Node: MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](a_2, b_1)]]
Caused by op 'MatMul_1', defined at:
File "<stdin>", line 4, in <module>
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in _mat_mul
transpose_b=transpose_b, name=name)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/abhor/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
[[Node: MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](a_2, b_1)]]
我看到在某些行中它说只有 CPU 可用。
这是我的显卡详细信息和 Cuda 版本。
nvidia-smi
的输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 940MX Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P0 N/A / N/A | 274MiB / 2002MiB | 10% Default |
+-------------------------------+----------------------+----------------------+
nvcc -V
的输出
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
我不知道如何检查 cuDNN,但我按照官方文档中给出的方式安装了它,所以我猜它应该也可以工作。
编辑:
pip3 list | grep tensorflow
的输出
tensorflow-gpu (1.3.0)
tensorflow-tensorboard (0.1.8)
【问题讨论】:
可以添加命令pip list | grep tensorflow
的输出吗?
我已将其添加到问题中。
您的错误有什么更新吗?
很可能是由于旧包造成的一些兼容性问题。我尝试在一个新的 Ubuntu 系统上安装,它成功了。
【参考方案1】:
试试这段代码:
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))
【讨论】:
您能解释一下它为什么有效吗?【参考方案2】:实际上tensorflow在你的情况下找不到CUDA GPU。
参考那里的输出设备列表:
Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]
这意味着没有找到 GPU。你可以参考How to get current available GPUs in tensorflow?这里的代码,列出GPU(tensorflow实际上可以找到)。
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
您必须确保返回实际找到的 gpu/s,这样 tensorflow 才能使用 gpu 设备。
gpu 找不到的可能性有很多,包括但不限于 CUDA 安装/设置、tensorflow 版本和 GPU 型号,尤其是 GPU 计算能力。必须检查特定 GPU 型号的 tensorflow 版本支持,并且必须检查 GPU 功能(适用于 NVidia GPU)。
【讨论】:
【参考方案3】:一般来说,我的建议是使用conda 环境。在这种情况下,您可以创建一个全新的环境并尝试从头开始安装 tensorflow 或任何其他工具,而无需重新安装整个操作系统。作为附加值,您可以在 PC 上拥有更多环境
【讨论】:
以上是关于无法在 GPU 上运行 tensorflow的主要内容,如果未能解决你的问题,请参考以下文章
无法使用 Python 在 GPU (Jetson Nano) 上运行 tflite 模型