拟合模型时,内核在一段时间后停止工作
Posted
技术标签:
【中文标题】拟合模型时,内核在一段时间后停止工作【英文标题】:Kernel stops working after a while, when fitting the model 【发布时间】:2021-10-08 22:59:32 【问题描述】:我正在尝试运行 TensorFlow 为 image classification 提供的代码。我使用的是 TensorFlow 提供的完全相同的代码,所以我不在这里分享。代码完美运行到适合模型的程度。它打印一次“Epoch”,然后内核关闭并显示“启动内核时发生错误”。作为它产生的错误消息:
2021???????? 21:19:59.749095: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021???????? 21:20:02.178383: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021???????? 21:20:02.198734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.62GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021???????? 21:20:02.198906: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021???????? 21:20:02.204104: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021???????? 21:20:02.204165: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021???????? 21:20:02.207305: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021???????? 21:20:02.208428: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021???????? 21:20:02.213539: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021???????? 21:20:02.215481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021???????? 21:20:02.216199: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021???????? 21:20:02.216287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021???????? 21:20:02.216750: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance‑critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021???????? 21:20:02.217490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.62GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021???????? 21:20:02.217546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021???????? 21:20:02.708850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021???????? 21:20:02.708874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021???????? 21:20:02.708880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021???????? 21:20:02.709035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) ‑> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2021???????? 21:20:04.004652: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021???????? 21:20:05.150123: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll```
【问题讨论】:
【参考方案1】:我在 colab 中复制了相同的给定代码。它运行成功,没有任何错误。请在此处找到相关代码gist。
但是,它们只是信息消息,因为它们以I
为前缀,如果是错误消息,它们将以E
或W
为前缀作为警告,如下所示:
2020-12-30 21:30:27.549172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_101.dll
2020-12-30 21:30:27.599977: W tensorflow/core/framework/allocator.cc:101] Allocation of 37171200 exceeds 10% of system memory.
2021-12-30 21:30:27.704083: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
您可以使用以下代码超越这些警告:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
【讨论】:
以上是关于拟合模型时,内核在一段时间后停止工作的主要内容,如果未能解决你的问题,请参考以下文章
Twincat ADS 事件驱动的读取在一段时间后停止工作(Java)