当我尝试训练 tensorflow 的对象检测 api 时，我得到 CUDA_ERROR_ILLEGAL_INSTRUCTION

Posted 2023-02-16

技术标签:

【中文标题】当我尝试训练 tensorflow 的对象检测 api 时，我得到 CUDA_ERROR_ILLEGAL_INSTRUCTION【英文标题】：When I try to train tensorflow's object detection api I get CUDA_ERROR_ILLEGAL_INSTRUCTION 【发布时间】：2018-12-31 15:17:31 【问题描述】：

python model_main.py --model_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2
_pets.config
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x00000155CFDB5C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [global_step] is not available in checkpoint
D:\ProgramFiles\Anaconda\envs\Eneger\lib\site-packages\tensorflow\python\ops\gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may
consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-07-24 05:19:44.209751: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorF
low binary was not compiled to use: AVX AVX2
2018-07-24 05:19:45.534609: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:

name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2018-07-24 05:19:45.571588: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-24 05:19:53.604693: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with stren
gth 1 edge matrix:
2018-07-24 05:19:53.615439: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929]      0
2018-07-24 05:19:53.622183: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0:   N
2018-07-24 05:19:53.643331: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/re
plica:0/task:0/device:GPU:0 with 4730 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-24 05:20:58.481169: E C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\stream_executor\cuda\cuda_event.cc:49] Error polling for event status: failed to query e
vent: **CUDA_ERROR_ILLEGAL_INSTRUCTION**
2018-07-24 05:20:58.511921: F C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_event_mgr.cc:208] Unexpected Event status: 1

【问题讨论】：

能否提供更多信息，如tensorflow 版本、cuda、cudnn 版本和操作系统名称？ 【参考方案1】：

自 3 月以来 tensorflow github 上有一个未解决的问题：

https://github.com/tensorflow/tensorflow/issues/17747

但似乎错误是随机出现的，而不是所有时间，你能确认一下吗？最后一个答案是 10 天前，所以我认为他们仍在努力。 L-

【讨论】：

我读过一个GTX 1080Ti，超频时随机出现同样的错误，你的是一个非常相似的显卡，可能是同样的问题？没有超频，一直发生 GPU驱动降级到以前的版本怎么样？（如果未更新，则升级）。【参考方案2】：

我会首先检查您的 cuda 和 cudnn 版本是否与您的 tensorflow 版本兼容。例如 cuda 工具包 9.0 而不是 cuda 工具包 9.1 或 cudnn 7.0 而不是 7.05。之后，我会从 Nvidia 官方网站更新 GPU 的驱动程序。

【讨论】：

【参考方案3】：

我不仅会检查您是否拥有 CUDA 9.0 和 cuDNN 7.0，而且还会确保您选择并安装的 CuDNN 版本是为 CUDA 9.0 启用的版本（CUDA 8.0、9.0 有 CuDNN 版本和 9.1)。

【讨论】：

我正在使用启用了正确 cuda 的 cudo 9.0 和 cudnn 7.0。我应该说我可以毫无问题地在 GPU 上运行一些神经网络，据我所知，具体是对象检测网络失败了。但是它在 CPU 上成功

以上是关于当我尝试训练 tensorflow 的对象检测 api 时，我得到 CUDA_ERROR_ILLEGAL_INSTRUCTION的主要内容，如果未能解决你的问题，请参考以下文章

在自己的数据集上训练 TensorFlow 对象检测

Tensorflow 对象检测 API - 当我尝试运行 model_builder_test.py 时出现 ImportError

我如何使用 tensorflow 对象检测来仅检测人员？

自定义对象检测train.py-错误：在https://pastebin.com/raw/EtkkfiDX处进行分块

Tensorflow 对象检测 API：TensorBoard 中损坏的训练图像

为啥 TensorFlow 对象检测 2.x 在训练模型时不显示 mAP