Google Colab - Tensorflow model_main_tf2:无法获得卷积算法。这可能是因为 cuDNN 未能初始化

Posted

技术标签:

【中文标题】Google Colab - Tensorflow model_main_tf2:无法获得卷积算法。这可能是因为 cuDNN 未能初始化【英文标题】:Google Colab - Tensorflow model_main_tf2: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize 【发布时间】:2021-12-24 23:35:47 【问题描述】:

我一直在 google colab 上运行此对象检测模型评估,没有出现错误。现在突然它不再起作用了,但是在运行脚本时。

# RUN MODEL EVALUATION
PIPELINE_CONFIG_PATH="./object_detection/checkpoints/detection//pipeline.config".format(selected_model)
MODEL_DIR="./object_detection/checkpoints/detection//checkpoint/".format(selected_model)
CHECKPOINT_DIR="./object_detection/checkpoints/detection//checkpoint/".format(selected_model)

!python ./object_detection/model_main_tf2.py \
  --pipeline_config_path=PIPELINE_CONFIG_PATH \
  --model_dir=MODEL_DIR \
  --checkpoint_dir=CHECKPOINT_DIR \
  --eval_timeout=5 \
  --alsologtostderr

它带有以下错误:

I1112 16:05:22.433352 139759485175680 checkpoint_utils.py:149] Found new checkpoint at ./object_detection/checkpoints/detection/ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0
/usr/local/lib/python3.7/dist-packages/keras/backend.py:401: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '
2021-11-12 16:05:22.520333: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.542140 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.542605 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.542898 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.543214 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.543522 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
I1112 16:05:31.543864 139759485175680 convolutional_keras_box_predictor.py:154] depth of additional conv before box predictor: 0
2021-11-12 16:06:17.471428: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-11-12 16:06:17.474623: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
INFO:tensorflow:Encountered 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D (defined at /usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_mobilenet_v2_keras_feature_extractor.py:161) ]]
     [[Identity_18/_1166]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D (defined at /usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_mobilenet_v2_keras_feature_extractor.py:161) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_24301]

Errors may have originated from an input operation.
Input Source operations connected to node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D:
 features_1 (defined at /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:932)

Input Source operations connected to node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D:
 features_1 (defined at /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:932)

Function call stack:
compute_eval_dict -> compute_eval_dict
 exception.
I1112 16:06:19.558837 139759485175680 model_lib_v2.py:934] Encountered 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D (defined at /usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_mobilenet_v2_keras_feature_extractor.py:161) ]]
     [[Identity_18/_1166]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D (defined at /usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_mobilenet_v2_keras_feature_extractor.py:161) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_24301]

Errors may have originated from an input operation.
Input Source operations connected to node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D:
 features_1 (defined at /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:932)

Input Source operations connected to node ssd_mobile_net_v2keras_feature_extractor/model/Conv1/Conv2D:
 features_1 (defined at /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:932)

Function call stack:
compute_eval_dict -> compute_eval_dict
 exception.
INFO:tensorflow:A replica probably exhausted all examples. Skipping pending examples on other replicas.
I1112 16:06:19.559331 139759485175680 model_lib_v2.py:935] A replica probably exhausted all examples. Skipping pending examples on other replicas.
Traceback (most recent call last):
  File "./object_detection/model_main_tf2.py", line 115, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "./object_detection/model_main_tf2.py", line 90, in main
    wait_interval=300, timeout=FLAGS.eval_timeout)
  File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 1157, in eval_continuously
    global_step=global_step,
  File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 1001, in eager_eval_loop
    for evaluator in evaluators:
TypeError: 'NoneType' object is not iterable

上周它仍然有效,但由于某种原因不再有效。还有其他人在同样的问题上苦苦挣扎吗?我猜 Colab 环境存在一些问题,但不知道我应该改变什么。 已安装 TF2 对象检测 API 并测试它正在工作

Tensorflow 2.6.2
Found GPU at: /device:GPU:0

【问题讨论】:

【参考方案1】:

发生错误是因为 Google Colab 上的 cuDNN 版本错误。

我可以通过从 NVidia 开发者网站下载正确版本的 cuDNN 来修复它,然后将其安装到 Google Colab 中。我首先将 cuDNN 包从 Google Drive 复制到我的 Google Colab 笔记本中,然后使用以下命令安装它:

!dpkg -i libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb
# Check if package has been installed
!ls -l /usr/lib/x86_64-linux-gnu/libcudnn.so.*

【讨论】:

以上是关于Google Colab - Tensorflow model_main_tf2:无法获得卷积算法。这可能是因为 cuDNN 未能初始化的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Google colab 中更改 Keras/tensorflow 版本?

将本地训练的 TensorFlow 模型导入 Google Colab

Google Colab 上的 Tensorflow Tensorboard (Ngrok)

TensorFlow 问题 google colab ; tensorflow._api.v1.compat.v2' 没有属性 '__internal__

在 Google Colab 中运行 TensorFlow 2 setup.py 会永远加载然后超时

使用 GPU 连接到本地运行时 google colab 需要 tensorflow-gpu?