Tensorflow 在 CPU 而不是 RTX 3000 系列 GPU 上训练

Posted

技术标签:

【中文标题】Tensorflow 在 CPU 而不是 RTX 3000 系列 GPU 上训练【英文标题】:Tensorflow trains on CPU instead of RTX 3000 series GPU 【发布时间】:2021-03-11 08:08:20 【问题描述】:

我正在尝试在我的 RTX 3070 GPU 上训练我的 tensorflow 模型。我正在使用 anaconda 虚拟环境,提示显示已成功检测到 GPU,并且没有显示任何错误或警告,但是每当模型开始训练时,它都会改用 CPU。

我的 Anaconda 提示:

2020-11-28 19:38:17.373117: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-11-28 19:38:17.378626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:17.378679: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:17.381802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-11-28 19:38:17.382739: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-11-28 19:38:17.389401: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-11-28 19:38:17.391830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-11-28 19:38:17.392332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:17.392422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1866] Adding visible gpu devices: 0
2020-11-28 19:38:26.072912: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-28 19:38:26.073904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1724] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: GeForce RTX 3070 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-11-28 19:38:26.073984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-11-28 19:38:26.074267: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:26.074535: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:26.074775: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-11-28 19:38:26.075026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-11-28 19:38:26.075275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-11-28 19:38:26.075646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-11-28 19:38:26.075871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:26.076139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1866] Adding visible gpu devices: 0
2020-11-28 19:38:26.738596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1265] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-28 19:38:26.738680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1271]      0
2020-11-28 19:38:26.739375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1284] 0:   N
2020-11-28 19:38:26.740149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1410] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6589 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:08:00.0, compute capability: 8.6)
2020-11-28 19:38:26.741055: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-11-28 19:38:28.028828: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:126] None of the MLIR optimization passes are enabled (registered 2)
2020-11-28 19:38:32.428408: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:33.305827: I tensorflow/stream_executor/cuda/cuda_dnn.cc:344] Loaded cuDNN version 8004
2020-11-28 19:38:33.753275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:34.603341: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:34.610934: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

我的模型代码:

inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(max_features, 128)(inputs)
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

我正在使用:

tensorflow nightly gpu 2.5.0.dev20201111(安装在 anaconda 虚拟环境中) CUDA 11.1 (cuda_11.1.1_456.81) CUDNN v8.0.4.30(适用于 CUDA 11.1) python 3.8

我知道我的 GPU 没有被使用,因为它的利用率为 1%,而我的 CPU 为 60%,其最高进程是 python。

谁能帮助我使用 GPU 进行模型训练?

【问题讨论】:

尝试重启。一旦我遇到类似的问题,并且在卸载普通 tf 并安装 tf-gpu ***.com/questions/44829085/… 后 tensorflow 没有正确初始化 @ShivamMiglani 我已经尝试过重启。没有解决任何问题,但感谢您的建议。 你检查过这个吗? ***.com/a/52905362/7363404 【参考方案1】:

很可能您将 Tensorflow 用于 CPU,而不是用于 GPU。执行“pip uninstall tensorflow”和“pip install tensorflow-gpu”来安装适合使用 GPU 的那个。

【讨论】:

我的错,我没有指定我使用的是 tf-nightly-gpu。 没问题。我有这 2 条建议给您: 1) 检查您是否已将 CUDA 加载到您的环境中。 2) 导入TF后添加以下行,打印变量“gpus”,检查是否可以通过代码找到设备。 "gpus = tf.config.experimental.list_physical_devices('GPU')" 或将其添加到您的代码中以打印可用设备列表: from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) 我确定 CUDA 安装正确,因为我的 anaconda 提示符中没有任何与 CUDA 相关的错误。当我打印 gpu 变量和尝试 tf.config.list_physical_devices('GPU') 时,可以通过代码找到 gpu。 好的。你是自己安装TF的吗?从您在问题中粘贴的内容来看,TF 似乎可以看到您的设备,但一个警告引起了我的注意“此 TensorFlow 二进制文件已使用 oneAPI 深度神经网络库 (oneDNN) 进行了优化,以在性能关键操作中使用以下 CPU 指令:AVX2要在其他操作中启用它们,请使用适当的编译器标志重新构建 TensorFlow"。

以上是关于Tensorflow 在 CPU 而不是 RTX 3000 系列 GPU 上训练的主要内容,如果未能解决你的问题,请参考以下文章

无需源码编译 | 基于RTX3090配置tensorflow1.15环境

我的 tensorflow 没有检测到我的 gpu 并使用我的 cpu(机器学习)

使用 GPU 和 CUDA、cuDNN、Anaconda、RTX 3060 Ti 运行 TensorFlow/Keras

关于RTX3090,ubuntu20.04环境下安装TensorFlow报错问题

关于RTX3090,ubuntu20.04环境下安装TensorFlow报错问题

关于RTX3090,ubuntu20.04环境下安装TensorFlow报错问题