在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))
Posted
技术标签:
【中文标题】在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))【英文标题】:3D CNN using keras-tensorflow on pycharm ( Process finished with exit code 137 (interrupted by signal 9: SIGKILL) ) 【发布时间】:2021-05-06 10:16:11 【问题描述】:我正在做一个 3D CNN 来分类 LUNA16 数据集(CT 扫描数据集),我在 pycharm 上使用 keras-tensorflow。
我正在关注此代码 https://github.com/keras-team/keras-io/blob/master/examples/vision/3D_image_classification.py
我刚刚修改它以适合我的 (*.mhd) 数据(这是我现在正在运行的)https://github.com/Mustafa-MS/3D-CNN-LUNA16/blob/main/3DCNN.py
每次我运行代码时都会出现不同的错误并停止该过程!但所有关于内存的错误。
进程以退出代码 137 结束(被信号 9:SIGKILL 中断) 内存不足 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 3355443200 的分配超过了可用系统内存的 10%。 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 369098752 的分配超过了可用系统内存的 10%。 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 3355443200 的分配超过了可用系统内存的 10%。我的模型总结是
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 128, 128, 64, 1)] 0
_________________________________________________________________
conv3d (Conv3D) (None, 126, 126, 62, 64) 1792
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 63, 63, 31, 64) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 63, 63, 31, 64) 256
_________________________________________________________________
conv3d_1 (Conv3D) (None, 61, 61, 29, 64) 110656
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 30, 30, 14, 64) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 30, 14, 64) 256
_________________________________________________________________
conv3d_2 (Conv3D) (None, 28, 28, 12, 128) 221312
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 14, 14, 6, 128) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 6, 128) 512
_________________________________________________________________
conv3d_3 (Conv3D) (None, 12, 12, 4, 256) 884992
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 2, 256) 0
_________________________________________________________________
batch_normalization_3 (Batch (None, 6, 6, 2, 256) 1024
_________________________________________________________________
global_average_pooling3d (Gl (None, 256) 0
_________________________________________________________________
dense (Dense) (None, 512) 131584
_________________________________________________________________
dropout (Dropout) (None, 512) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 513
=================================================================
Total params: 1,352,897
Trainable params: 1,351,873
Non-trainable params: 1,024
每次CT扫描的维度是
(128, 128, 64, 1)
训练和验证的形式是xtrain = (800, 128, 128, 64) / xval = (88, 128, 128, 64) / ytrain = (800,) / yval = (88,)
批量大小 = 2
我正在使用 wandb 监控我的模型,你可以在这里查看 https://wandb.ai/mustafa-ms/monitor-gpu?workspace=user-mustafa-ms 它表明模型在停止工作之前正在消耗 100% 的每个系统内存和 gpu 内存。 picture of gpu memory Allocated % picture of system memory allocated % 我知道关于同一个问题有很多答案,但没有一个能解决我的问题。 CNN不大,batch很小2只,我的数据只有888 CT扫描! 我的电脑有 32 GB 内存和 RTX 2080ti gpu。 完整的日志在这里
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/home/mustafa/home/mustafa/project/LUNAMASK', '/home/mustafa/home/mustafa/project/LUNAMASK'])
PyDev console: starting.
Python 3.8.7 (default, Dec 21 2020, 20:10:35)
[GCC 7.5.0] on linux
runfile('/home/mustafa/home/mustafa/project/LUNAMASK/3DCNN.py', wdir='/home/mustafa/home/mustafa/project/LUNAMASK')
2021-02-02 05:40:34.999468: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Currently logged in as: mustafa-ms (use `wandb login --relogin` to force relogin)
2021-02-02 05:40:37.643336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.10.15
wandb: Syncing run clean-violet-3
wandb: ⭐️ View project at https://wandb.ai/mustafa-ms/monitor-gpu
wandb: ???? View run at https://wandb.ai/mustafa-ms/monitor-gpu/runs/4y03vu5s
wandb: Run data is saved locally in /home/mustafa/home/mustafa/project/LUNAMASK/wandb/run-20210202_054036-4y03vu5s
wandb: Run `wandb offline` to turn off syncing.
y train length 800
y test length 88
xtrain = (800, 128, 128, 64)
xval = (88, 128, 128, 64)
ytrain = (800,)
yval = 2021-02-02 08:27:05.599099: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
(88,)
2021-02-02 08:27:05.606801: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-02 08:27:05.657391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.658293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.605GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-02-02 08:27:05.658325: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:05.667884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:05.667982: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:05.674032: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-02 08:27:05.676356: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-02 08:27:05.684058: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-02 08:27:05.686346: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-02 08:27:05.687068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-02 08:27:05.687204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.688185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.689043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-02 08:27:05.690061: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-02 08:27:05.690188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.691084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.605GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-02-02 08:27:05.691117: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:05.691137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:05.691152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:05.691165: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-02 08:27:05.691179: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-02 08:27:05.691192: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-02 08:27:05.691205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-02 08:27:05.691218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-02 08:27:05.691292: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.692206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.693051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-02 08:27:05.693086: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:06.001440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-02 08:27:06.001467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-02-02 08:27:06.001473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-02-02 08:27:06.001663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.002169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.002643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.003097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9508 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2021-02-02 08:27:06.004312: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
2021-02-02 08:27:06.983079: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 369098752 exceeds 10% of free system memory.
2021-02-02 08:27:07.406900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
2021-02-02 08:27:09.210752: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-02-02 08:27:09.229323: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3699750000 Hz
Dimension of the CT scan is: (128, 128, 64, 1)
Model: "3dcnn"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 128, 128, 64, 1)] 0
_________________________________________________________________
conv3d (Conv3D) (None, 126, 126, 62, 64) 1792
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 63, 63, 31, 64) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 63, 63, 31, 64) 256
_________________________________________________________________
conv3d_1 (Conv3D) (None, 61, 61, 29, 64) 110656
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 30, 30, 14, 64) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 30, 14, 64) 256
_________________________________________________________________
conv3d_2 (Conv3D) (None, 28, 28, 12, 128) 221312
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 14, 14, 6, 128) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 6, 128) 512
_________________________________________________________________
conv3d_3 (Conv3D) (None, 12, 12, 4, 256) 884992
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 2, 256) 0
_________________________________________________________________
batch_normalization_3 (Batch (None, 6, 6, 2, 256) 1024
_________________________________________________________________
global_average_pooling3d (Gl (None, 256) 0
_________________________________________________________________
dense (Dense) (None, 512) 131584
_________________________________________________________________
dropout (Dropout) (None, 512) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 513
=================================================================
Total params: 1,352,897
Trainable params: 1,351,873
Non-trainable params: 1,024
_________________________________________________________________
2021-02-02 08:27:26.194010: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 369098752 exceeds 10% of free system memory.
2021-02-02 08:27:26.397041: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
Epoch 1/5
2021-02-02 08:27:30.705650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:31.247841: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:31.879674: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
400/400 - 105s - loss: 0.6529 - acc: 0.6325 - val_loss: 0.8511 - val_acc: 0.6705
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
【问题讨论】:
【参考方案1】:现在代码在 google colab 上完美运行。
我认为限制在于 GPU(我的 gpu 是 RTX 2080ti)与 google colab gpu Nvidia T4。
我只是将数据预处理并保存为numpy数组,然后将数组上传到google colab,并在预处理后运行代码。 现在一切正常!
【讨论】:
以上是关于在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))的主要内容,如果未能解决你的问题,请参考以下文章
在 docker 内的远程机器上使用 PyCharm 的远程调试器