tensorflow 不使用 gpu，但 cuda 使用

Posted 2023-04-15

技术标签:

【中文标题】tensorflow 不使用 gpu，但 cuda 使用【英文标题】：tensorflow does not use gpu, but cuda does 【发布时间】：2017-07-26 07:47:39 【问题描述】：

tensorflow 看不到我的 GPU。我正在使用 optimus 设置。

nvidia-smi 显示我的卡片

[user@system bal]$ optirun nvidia-smi 
Mon Mar  6 13:24:05 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.13                 Driver Version: 378.13                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K1100M       Off  | 0000:01:00.0     Off |                  N/A |
| N/A   40C    P0    N/A /  N/A |      7MiB /  1999MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1847    G   /usr/lib/xorg-server/Xorg                        7MiB |
+-----------------------------------------------------------------------------+

cuda 看到 gpu。这是 deviceQuery 输出

[user@system release]$ optirun ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1100M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1999 MBytes (2096300032 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             1400 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K1100M
Result = PASS

但是tensorflow不使用gpu

import tensorflow as tf

# Creates a graph.
#with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2],     name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

输出似乎表明，只使用了 CPU

[user@system bal]$ optirun python ex.py
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
[[ 22.  28.]
 [ 49.  64.]]

那么，我能做什么，tensorflow 看到我的 gpu？我正在使用archlinux，我假设我拥有所有最新版本。有什么我可以检查的吗？

【问题讨论】：

总是一样的。 [user@system bal]$ TF_CPP_MIN_LOG_LEVEL=0 python ex.py ... I tensorflow/core/common_runtime/gpu/gpu_device.cc:948] 忽略可见 gpu 设备（设备：0，名称：Quadro K1100M，pci 总线 ID： 0000:01:00.0) 具有 Cuda 计算能力 3.0。所需的最低 Cuda 能力为 3.5。很棒。您可以自己构建 tf 以使用 3.0。我目前正在尝试，但在我的系统上构建 tensorflw 需要大约 45 分钟。我希望我给了它正确的构建参数该卡有多少个流式多处理器？ TensorFlow 有一个限制，它不包括流式多处理器太少的 GPU，因为这些 GPU 的运行速度会比 CPU 慢。有一个环境变量来配置此行为，您可以尝试将TF_MIN_GPU_MULTIPROCESSOR_COUNT 设置为较低的值以确保包含您的 GPU 这篇文章建议“在 Bios 中禁用集成的英特尔显卡”devtalk.nvidia.com/default/topic/977952/… 【参考方案1】：

官方最小的 CUDA 计算能力是 3.5。你的卡有3.0。据说有些人可以编译 tensorflow 来使用 3.0 CC，但它需要用非官方的补丁给 TF 打补丁。查看更多：The minimum required Cuda capability is 3.5。

【讨论】：

以上是关于tensorflow 不使用 gpu，但 cuda 使用的主要内容，如果未能解决你的问题，请参考以下文章