Tensorflow 需要很长时间才能连接到 Nvidia-Drivers
Posted
技术标签:
【中文标题】Tensorflow 需要很长时间才能连接到 Nvidia-Drivers【英文标题】:Tensorflow takes long time to connect to Nvidia-Drivers 【发布时间】:2020-01-08 06:42:11 【问题描述】:我已经在我的远程服务器上安装了我的 Nvidia 驱动程序,并且
$nvidia-smi
也返回设备列表
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 36% 64C P2 78W / 300W | 9813MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |
| 0% 45C P8 8W / 300W | 774MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 333 C python 9039MiB |
| 0 11773 C python 191MiB |
| 0 12111 C python 191MiB |
| 0 12591 C python 191MiB |
| 0 12999 C python 191MiB |
| 1 11773 C python 191MiB |
| 1 12111 C python 191MiB |
| 1 12591 C python 191MiB |
| 1 12999 C python 191MiB |
+-----------------------------------------------------------------------------+
但是当我尝试使用时。
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
给出设备列表需要很长时间
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2019-09-05 15:56:27.568310: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-05 15:56:27.596133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-09-05 15:56:27.597408: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9a2fe370 executing computations on platform Host. Devices:
2019-09-05 15:56:27.597421: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-09-05 15:56:27.598606: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-05 15:59:27.755714: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.757588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758287: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9c6d9140 executing computations on platform CUDA. Devices:
2019-09-05 15:59:27.758299: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758303: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758508: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2019-09-05 15:59:27.758919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.759313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:02:00.0
2019-09-05 15:59:27.759441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.760153: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-05 15:59:27.760803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-05 15:59:27.760941: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-05 15:59:27.761755: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-05 15:59:27.762395: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-05 15:59:27.764370: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-05 15:59:27.764435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.764846: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765262: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.766026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2019-09-05 15:59:27.766046: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.767185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-05 15:59:27.767194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2019-09-05 15:59:27.767197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2019-09-05 15:59:27.767200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
2019-09-05 15:59:27.767370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.767755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 950 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-09-05 15:59:27.768811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.769219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 9703 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality
incarnation: 14451063918691325445
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality
incarnation: 3472161188084064797
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality
incarnation: 1975372846861552523
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality
incarnation: 8192574289833793917
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 996802560
locality
bus_id: 1
links
link
device_id: 1
type: "StreamExecutor"
strength: 1
incarnation: 9560317274260358344
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 10175270093
locality
bus_id: 1
links
link
type: "StreamExecutor"
strength: 1
incarnation: 16877706551631715197
physical_device_desc: "device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5"
当我尝试检查信息以使用 nvidia-docker
sudo nvidia-container-cli -k -d /dev/tty info
它给出以下输出:
I0905 09:00:43.600569 9993 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out
【问题讨论】:
nvidia-persistenced 服务应该运行以保持 nvidia-driver 始终处于加载状态。有关如何启用它的信息,请参阅:docs.nvidia.com/deploy/driver-persistence/index.html,或适用于您操作系统的 NVIDIA 驱动程序文档。 我也有同样的问题。你能解决吗? 检查设备是否已启动,因为 nvlink 已启动并正常工作。主要是灰尘造成了问题 【参考方案1】:我的 nvlink 好像有问题,
所以如果 Incase 如果您的训练开始花费太多时间。
请检查链接是否正常。
以下命令如下。
$ nvidia-smi nvlink --status
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-797d7153-ea28-d678-dc38-859b914d6dd7)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-8807c553-7571-582d-c2ee-02993527b0a6)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
谢谢
【讨论】:
以上是关于Tensorflow 需要很长时间才能连接到 Nvidia-Drivers的主要内容,如果未能解决你的问题,请参考以下文章
使用 Flex 时 SQL Server 需要很长时间才能将数据返回到 ColdFusion
将视图插入表格 - 视图不需要很长时间才能运行 - 插入需要很长时间
Python readlines Api从串口访问时需要很长时间