tensorflow 和 torch.cuda 可以找到 GPU，但 Keras 不能

Posted 2023-02-23

技术标签:

【中文标题】tensorflow 和 torch.cuda 可以找到 GPU，但 Keras 不能【英文标题】：tensorflow and torch.cuda can find GPU but Keras only can't 【发布时间】：2019-12-15 19:01:47 【问题描述】：

我尝试使用How do I check if keras is using gpu version of tensorflow? 回答。但我认识到只有 keras 没有看到 GPU。

我重新安装了整个需求，包括 tensorflow-gpu、keras 模块，甚至 CUDA。

我正在使用 Jupyter 远程 ipython。

以下列表是我安装的模块版本

...
keras                     2.2.4
keras-applications        1.0.8
keras-preprocessing       1.1.0
...
tensorflow-gpu            1.14.0
...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

我检查了以下内容：

print(device_lib.list_local_devices())
print()

from keras import backend
print(backend.tensorflow_backend._get_available_gpus())
print()

from torch import cuda
print(cuda.is_available())
print(cuda.device_count())
print(cuda.get_device_name(cuda.current_device()))
print()

结果：

device_type: "CPU"
memory_limit: 268435456
locality 

incarnation: 15355337614284368930
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality 

incarnation: 5758691101165968939
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality 

incarnation: 17050701241022830982
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality 

incarnation: 15949544090620437264
physical_device_desc: "device: XLA_GPU device"
]

[]

True
2
GeForce GTX 1080 Ti

==========添加===========

我也跟着How to tell if tensorflow is using gpu acceleration from inside python shell? 在终端回答。我试过了：

with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

结果：

2019-08-08 16:16:57.060679: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-08 16:16:57.075040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:05:00.0
2019-08-08 16:16:57.076003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:0a:00.0
2019-08-08 16:16:57.076256: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-08 16:16:57.078074: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-08 16:16:57.080007: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-08 16:16:57.080436: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-08 16:16:57.083506: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-08 16:16:57.085629: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-08 16:16:57.086483: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/tink/dlgks224/conda/lib:
2019-08-08 16:16:57.086537: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-08-08 16:16:57.087195: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-08 16:16:57.117070: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198685000 Hz
2019-08-08 16:16:57.119097: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55eab648cdc0 executing computations on platform Host. Devices:
2019-08-08 16:16:57.119231: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-08 16:16:57.119383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-08 16:16:57.119397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
2019-08-08 16:16:57.483390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55eab653adf0 executing computations on platform CUDA. Devices:
2019-08-08 16:16:57.483443: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-08-08 16:16:57.483454: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
Traceback (most recent call last):
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
    self._extend_graph()
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMulwas explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
     [[MatMul]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/tink/dlgks224/conda/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMul (defined at <stdin>:4) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
     [[MatMul]]

Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
 a (defined at <stdin>:2)   
 b (defined at <stdin>:3)

【问题讨论】：

Keras 没有特定的 GPU 支持，它是通过 tensorflow 发生的，所以这一切都没有意义。你在训练时看到消息中的 GPU 了吗？ @MatiasValdenegro 是的，我刚刚通过 nvidia-smi 确认进程正在运行（我无法确保它是否正常运行）。 【参考方案1】：

解决了！

这是一个非常愚蠢的问题。

错误一直告诉我它是什么。

我再次检查了libcudnn.so.7，它安装在错误的位置。

遇到类似错误时请验证！

2019-08-08 16:16:57.086483: I tensorflow/stream_executor/platform/default/dso_loader.cc:53]
Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7:
cannot open shared object file: No such file or directory;
LD_LIBRARY_PATH: usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/tink/dlgks224/conda/lib:

【讨论】：

以上是关于tensorflow 和 torch.cuda 可以找到 GPU，但 Keras 不能的主要内容，如果未能解决你的问题，请参考以下文章