使用 GPU 和 CUDA、cuDNN、Anaconda、RTX 3060 Ti 运行 TensorFlow/Keras

Posted 2023-03-16

技术标签:

【中文标题】使用 GPU 和 CUDA、cuDNN、Anaconda、RTX 3060 Ti 运行 TensorFlow/Keras【英文标题】：Running Tensorflow/Keras Using GPU with CUDA, cuDNN, Anaconda, RTX 3060 Ti 【发布时间】：2021-03-28 11:00:22 【问题描述】：

我第一次尝试使用我的新 RTX 3060 Ti 训练神经网络，但遇到了一个棘手的错误。以下是错误信息：

2020-12-17 12:45:09.600373: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "marge.py", line 365, in <module>
    MARGE(*sys.argv[1:])
  File "marge.py", line 357, in MARGE
    filters, filt2um)
  File "lib\NN.py", line 712, in driver
    nn.train(train_batches, valid_batches, epochs, patience)
  File "lib\NN.py", line 335, in train
    model_checkpoint])
  File "C:\Users\Nick\anaconda3\envs\marge\lib\site-packages\keras\engine\training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "C:\Users\Nick\anaconda3\envs\marge\lib\site-packages\keras\engine\training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "C:\Users\Nick\anaconda3\envs\marge\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\Nick\anaconda3\envs\marge\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\Users\Nick\anaconda3\envs\marge\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(256, 7), b.shape=(7, 4096), m=256, n=4096, k=7
         [[node dense_1/MatMul]]
         [[loss/mul/_125]]
  (1) Internal: Blas GEMM launch failed : a.shape=(256, 7), b.shape=(7, 4096), m=256, n=4096, k=7
         [[node dense_1/MatMul]]
0 successful operations.
0 derived errors ignored.

我在 Windows 10 上的 Python 3.7.2 Anaconda 环境中工作。以下是（我认为是）安装在此环境中的相关软件包：

cudnn                     7.6.5
cudnn                     7.6.5
keras                     2.2.4
keras-applications        1.0.8
keras-base                2.2.4
keras-preprocessing       1.1.0
tensorflow                1.14.0
tensorflow-base           1.14.0
tensorflow-estimator      1.14.0
tensorflow-gpu            1.14.0

我尝试过的事情：

安装 NVidia 版本的 Tensorflow，但命令 pip install --user nvidia-pyindex 和 pip install --user nvidia-tensorflow[horovod] 导致错误。

在我的代码顶部添加以下内容：

config = tf.ConfigProto()

config.gpu_options.allow_growth = True

会话 = tf.Session(config=config)

这也不起作用（“未能运行 cuBLAS 例程：CUBLAS_STATUS_EXECUTION_FAILED”），但似乎我的 GPU 已被识别：

2020-12-17 12:27:24.445007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 3060 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:06:00.0
2020-12-17 12:27:24.445413: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-12-17 12:27:24.445639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-12-17 12:27:24.445772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-17 12:27:24.445881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2020-12-17 12:27:24.445971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2020-12-17 12:27:24.446221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6712 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Ti, pci bus id: 0000:06:00.0, compute capability: 8.6)

RTX 3060 Ti 是否可能尚不支持此用途？

如果我可以提供任何其他信息，请告诉我。提前感谢您的帮助！

编辑：我也试过thesetutorials的建议（回想起来，安装CUDA和cuDNN似乎很重要）。我还运行了命令：

tf.test.is_built_with_cuda()
tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)

两者都返回“True”。但是，我仍然收到“未能运行 cuBLAS 例程”错误。我还注意到 3060 Ti 没有出现在 NVIDEA 的 list 的 CUDA 兼容 GPU 上，所以也许我只是运气不好......

【问题讨论】：

【参考方案1】：

当我尝试在我的 GPU 上运行时，我也遇到了一些困难，就像你一样。记不太清了，怕帮不上什么忙。

我看到您使用的是低于 TensorFlow 的第二个版本。您必须为您使用的 TensorFlow 1.14 安装正确的 Cuda 和 cuDNN 版本。

这是您应该尝试安装的所有链接：

GPU 驱动程序：https://www.nvidia.com/Download/driverResults.aspx/167753/en-us 正确的CUDA版本：https://developer.nvidia.com/cuda-10.0-download-archive 正确的cuDNN版本（需要创建账号但下载7.4版本）：https://developer.nvidia.com/cudnn

我希望这可行，如果可行，请告诉我，祝你好运！

【讨论】：

感谢您的回复。我开始认为我的问题可能与我安装的软件版本有关。我在其他地方（reddit.com/r/nvidia/comments/hg45ux/…）读到我需要安装 TF 2.4，所以我已经这样做了。从那个 reddit 线程看来，TF 2.4/3000 系列卡需要 CUDA 11.0，而我安装的是 CUDA 11.1。接下来我将尝试降级我的 CUDA 安装。

以上是关于使用 GPU 和 CUDA、cuDNN、Anaconda、RTX 3060 Ti 运行 TensorFlow/Keras的主要内容，如果未能解决你的问题，请参考以下文章

TensorFlow各个GPU版本CUDA和cuDNN对应版本

使用 GPU 和 CUDA、cuDNN、Anaconda、RTX 3060 Ti 运行 TensorFlow/Keras

Pytorch 各个GPU版本CUDA和cuDNN对应版本

cuda和cudnn安装过程

cuda 8.0对应啥cudnn版本

tensorflow只能在装有gpu的机器上运行