Tensorflow 需要很长时间才能连接到 Nvidia-Drivers

Posted 2023-04-15

技术标签:

【中文标题】Tensorflow 需要很长时间才能连接到 Nvidia-Drivers【英文标题】：Tensorflow takes long time to connect to Nvidia-Drivers 【发布时间】：2020-01-08 06:42:11 【问题描述】：

我已经在我的远程服务器上安装了我的 Nvidia 驱动程序，并且

$nvidia-smi 也返回设备列表

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 36%   64C    P2    78W / 300W |   9813MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   45C    P8     8W / 300W |    774MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       333      C   python                                      9039MiB |
|    0     11773      C   python                                       191MiB |
|    0     12111      C   python                                       191MiB |
|    0     12591      C   python                                       191MiB |
|    0     12999      C   python                                       191MiB |
|    1     11773      C   python                                       191MiB |
|    1     12111      C   python                                       191MiB |
|    1     12591      C   python                                       191MiB |
|    1     12999      C   python                                       191MiB |
+-----------------------------------------------------------------------------+

但是当我尝试使用时。

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()

给出设备列表需要很长时间

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2019-09-05 15:56:27.568310: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-05 15:56:27.596133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-09-05 15:56:27.597408: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9a2fe370 executing computations on platform Host. Devices:
2019-09-05 15:56:27.597421: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-05 15:56:27.598606: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-05 15:59:27.755714: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.757588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758287: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9c6d9140 executing computations on platform CUDA. Devices:
2019-09-05 15:59:27.758299: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758303: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758508: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2019-09-05 15:59:27.758919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.759313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:02:00.0
2019-09-05 15:59:27.759441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.760153: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-05 15:59:27.760803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-05 15:59:27.760941: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-05 15:59:27.761755: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-05 15:59:27.762395: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-05 15:59:27.764370: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-05 15:59:27.764435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.764846: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765262: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.766026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2019-09-05 15:59:27.766046: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.767185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-05 15:59:27.767194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1
2019-09-05 15:59:27.767197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y
2019-09-05 15:59:27.767200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N
2019-09-05 15:59:27.767370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.767755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 950 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-09-05 15:59:27.768811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.769219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 9703 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality 

incarnation: 14451063918691325445
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality 

incarnation: 3472161188084064797
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality 

incarnation: 1975372846861552523
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality 

incarnation: 8192574289833793917
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 996802560
locality 
  bus_id: 1
  links 
    link 
      device_id: 1
      type: "StreamExecutor"
      strength: 1
    
  

incarnation: 9560317274260358344
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 10175270093
locality 
  bus_id: 1
  links 
    link 
      type: "StreamExecutor"
      strength: 1
    
  

incarnation: 16877706551631715197
physical_device_desc: "device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5"

当我尝试检查信息以使用 nvidia-docker

sudo nvidia-container-cli -k -d /dev/tty info

它给出以下输出：

I0905 09:00:43.600569 9993 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out

【问题讨论】：

nvidia-persistenced 服务应该运行以保持 nvidia-driver 始终处于加载状态。有关如何启用它的信息，请参阅：docs.nvidia.com/deploy/driver-persistence/index.html，或适用于您操作系统的 NVIDIA 驱动程序文档。我也有同样的问题。你能解决吗？检查设备是否已启动，因为 nvlink 已启动并正常工作。主要是灰尘造成了问题 【参考方案1】：

我的 nvlink 好像有问题，

所以如果 Incase 如果您的训练开始花费太多时间。

请检查链接是否正常。

以下命令如下。

$ nvidia-smi nvlink --status

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-797d7153-ea28-d678-dc38-859b914d6dd7)
     Link 0: 25.781 GB/s
     Link 1: 25.781 GB/s
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-8807c553-7571-582d-c2ee-02993527b0a6)
     Link 0: 25.781 GB/s
     Link 1: 25.781 GB/s

谢谢

【讨论】：

以上是关于Tensorflow 需要很长时间才能连接到 Nvidia-Drivers的主要内容，如果未能解决你的问题，请参考以下文章

使用 Flex 时 SQL Server 需要很长时间才能将数据返回到 ColdFusion

应用内购买“等待审核”很长时间[关闭]

将视图插入表格 - 视图不需要很长时间才能运行 - 插入需要很长时间

Python readlines Api从串口访问时需要很长时间

如何使用 Apache Ignite.NET 瘦客户端连接到特定网格

Electron Secure Mysql凭证