拟合模型时,内核在一段时间后停止工作

Posted

技术标签:

【中文标题】拟合模型时,内核在一段时间后停止工作【英文标题】:Kernel stops working after a while, when fitting the model 【发布时间】:2021-10-08 22:59:32 【问题描述】:

我正在尝试运行 TensorFlow 为 image classification 提供的代码。我使用的是 TensorFlow 提供的完全相同的代码,所以我不在这里分享。代码完美运行到适合模型的程度。它打印一次“Epoch”,然后内核关闭并显示“启动内核时发生错误”。作为它产生的错误消息:

2021???????? 21:19:59.749095: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021???????? 21:20:02.178383: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021???????? 21:20:02.198734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.62GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021???????? 21:20:02.198906: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021???????? 21:20:02.204104: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021???????? 21:20:02.204165: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021???????? 21:20:02.207305: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021???????? 21:20:02.208428: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021???????? 21:20:02.213539: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021???????? 21:20:02.215481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021???????? 21:20:02.216199: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021???????? 21:20:02.216287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021???????? 21:20:02.216750: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance‑critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021???????? 21:20:02.217490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.62GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021???????? 21:20:02.217546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021???????? 21:20:02.708850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021???????? 21:20:02.708874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 
2021???????? 21:20:02.708880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 
2021???????? 21:20:02.709035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) ‑> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2021???????? 21:20:04.004652: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021???????? 21:20:05.150123: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll```

【问题讨论】:

【参考方案1】:

我在 colab 中复制了相同的给定代码。它运行成功,没有任何错误。请在此处找到相关代码gist。

但是,它们只是信息消息,因为它们以I为前缀,如果是错误消息,它们将以EW为前缀作为警告,如下所示:

2020-12-30 21:30:27.549172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_101.dll

2020-12-30 21:30:27.599977: W tensorflow/core/framework/allocator.cc:101] Allocation of 37171200 exceeds 10% of system memory.

2021-12-30 21:30:27.704083: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES

您可以使用以下代码超越这些警告:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

【讨论】:

以上是关于拟合模型时,内核在一段时间后停止工作的主要内容,如果未能解决你的问题,请参考以下文章

GPS跟踪的前台服务在一段时间后停止工作

Twincat ADS 事件驱动的读取在一段时间后停止工作(Java)

log4cpp 在一段时间后停止正常工作

按钮声音在一段时间后停止

在 vue 3 中获取会停止后端,并且在一次正常工作后啥也不做

一段时间后停止 Spark Streaming 作业