在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))

Posted

技术标签:

【中文标题】在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))【英文标题】:3D CNN using keras-tensorflow on pycharm ( Process finished with exit code 137 (interrupted by signal 9: SIGKILL) ) 【发布时间】:2021-05-06 10:16:11 【问题描述】:

我正在做一个 3D CNN 来分类 LUNA16 数据集(CT 扫描数据集),我在 pycharm 上使用 keras-tensorflow。

我正在关注此代码 https://github.com/keras-team/keras-io/blob/master/examples/vision/3D_image_classification.py

我刚刚修改它以适合我的 (*.mhd) 数据(这是我现在正在运行的)https://github.com/Mustafa-MS/3D-CNN-LUNA16/blob/main/3DCNN.py

每次我运行代码时都会出现不同的错误并停止该过程!但所有关于内存的错误。

进程以退出代码 137 结束(被信号 9:SIGKILL 中断) 内存不足 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 3355443200 的分配超过了可用系统内存的 10%。 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 369098752 的分配超过了可用系统内存的 10%。 W tensorflow/core/framework/cpu_allocator_impl.cc:80] 3355443200 的分配超过了可用系统内存的 10%。

我的模型总结是

    _________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 128, 128, 64, 1)] 0         
_________________________________________________________________
conv3d (Conv3D)              (None, 126, 126, 62, 64)  1792      
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 63, 63, 31, 64)    0         
_________________________________________________________________
batch_normalization (BatchNo (None, 63, 63, 31, 64)    256       
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 61, 61, 29, 64)    110656    
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 30, 30, 14, 64)    0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 30, 14, 64)    256       
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 28, 28, 12, 128)   221312    
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 14, 14, 6, 128)    0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 6, 128)    512       
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 12, 12, 4, 256)    884992    
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 2, 256)      0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 6, 6, 2, 256)      1024      
_________________________________________________________________
global_average_pooling3d (Gl (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               131584    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
=================================================================
Total params: 1,352,897
Trainable params: 1,351,873
Non-trainable params: 1,024

每次CT扫描的维度是 (128, 128, 64, 1)

训练和验证的形式是xtrain = (800, 128, 128, 64) / xval = (88, 128, 128, 64) / ytrain = (800,) / yval = (88,)

批量大小 = 2

我正在使用 wandb 监控我的模型,你可以在这里查看 https://wandb.ai/mustafa-ms/monitor-gpu?workspace=user-mustafa-ms 它表明模型在停止工作之前正在消耗 100% 的每个系统内存和 gpu 内存。 picture of gpu memory Allocated % picture of system memory allocated % 我知道关于同一个问题有很多答案,但没有一个能解决我的问题。 CNN不大,batch很小2只,我的数据只有888 CT扫描! 我的电脑有 32 GB 内存和 RTX 2080ti gpu。 完整的日志在这里

import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/home/mustafa/home/mustafa/project/LUNAMASK', '/home/mustafa/home/mustafa/project/LUNAMASK'])
PyDev console: starting.
Python 3.8.7 (default, Dec 21 2020, 20:10:35) 
[GCC 7.5.0] on linux
runfile('/home/mustafa/home/mustafa/project/LUNAMASK/3DCNN.py', wdir='/home/mustafa/home/mustafa/project/LUNAMASK')
2021-02-02 05:40:34.999468: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Currently logged in as: mustafa-ms (use `wandb login --relogin` to force relogin)
2021-02-02 05:40:37.643336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.10.15
wandb: Syncing run clean-violet-3
wandb: ⭐️ View project at https://wandb.ai/mustafa-ms/monitor-gpu
wandb: ???? View run at https://wandb.ai/mustafa-ms/monitor-gpu/runs/4y03vu5s
wandb: Run data is saved locally in /home/mustafa/home/mustafa/project/LUNAMASK/wandb/run-20210202_054036-4y03vu5s
wandb: Run `wandb offline` to turn off syncing.
y train length  800
y test length  88
xtrain =  (800, 128, 128, 64)
xval =  (88, 128, 128, 64)
ytrain =  (800,)
yval = 2021-02-02 08:27:05.599099: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
 (88,)
2021-02-02 08:27:05.606801: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-02 08:27:05.657391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.658293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.605GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-02-02 08:27:05.658325: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:05.667884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:05.667982: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:05.674032: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-02 08:27:05.676356: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-02 08:27:05.684058: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-02 08:27:05.686346: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-02 08:27:05.687068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-02 08:27:05.687204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.688185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.689043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-02 08:27:05.690061: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-02 08:27:05.690188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.691084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.605GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-02-02 08:27:05.691117: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:05.691137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:05.691152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:05.691165: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-02 08:27:05.691179: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-02 08:27:05.691192: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-02 08:27:05.691205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-02 08:27:05.691218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-02 08:27:05.691292: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.692206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:05.693051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-02 08:27:05.693086: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-02 08:27:06.001440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-02 08:27:06.001467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-02-02 08:27:06.001473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-02-02 08:27:06.001663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.002169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.002643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-02 08:27:06.003097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9508 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2021-02-02 08:27:06.004312: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
2021-02-02 08:27:06.983079: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 369098752 exceeds 10% of free system memory.
2021-02-02 08:27:07.406900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
2021-02-02 08:27:09.210752: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-02-02 08:27:09.229323: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3699750000 Hz
Dimension of the CT scan is: (128, 128, 64, 1)
Model: "3dcnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 128, 128, 64, 1)] 0         
_________________________________________________________________
conv3d (Conv3D)              (None, 126, 126, 62, 64)  1792      
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 63, 63, 31, 64)    0         
_________________________________________________________________
batch_normalization (BatchNo (None, 63, 63, 31, 64)    256       
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 61, 61, 29, 64)    110656    
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 30, 30, 14, 64)    0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 30, 14, 64)    256       
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 28, 28, 12, 128)   221312    
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 14, 14, 6, 128)    0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 6, 128)    512       
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 12, 12, 4, 256)    884992    
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 2, 256)      0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 6, 6, 2, 256)      1024      
_________________________________________________________________
global_average_pooling3d (Gl (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               131584    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
=================================================================
Total params: 1,352,897
Trainable params: 1,351,873
Non-trainable params: 1,024
_________________________________________________________________
2021-02-02 08:27:26.194010: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 369098752 exceeds 10% of free system memory.
2021-02-02 08:27:26.397041: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 3355443200 exceeds 10% of free system memory.
Epoch 1/5
2021-02-02 08:27:30.705650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-02 08:27:31.247841: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-02 08:27:31.879674: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
400/400 - 105s - loss: 0.6529 - acc: 0.6325 - val_loss: 0.8511 - val_acc: 0.6705
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

【问题讨论】:

【参考方案1】:

现在代码在 google colab 上完美运行。

我认为限制在于 GPU(我的 gpu 是 RTX 2080ti)与 google colab gpu Nvidia T4。

我只是将数据预处理并保存为numpy数组,然后将数组上传到google colab,并在预处理后运行代码。 现在一切正常!

【讨论】:

以上是关于在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))的主要内容,如果未能解决你的问题,请参考以下文章

pycharm安装包怎么惠子安装在一个项目上

[jetson][转载]jetson上安装pycharm

在 Linux 上更新 PyCharm

在 docker 内的远程机器上使用 PyCharm 的远程调试器

使用 Robot Framework 在 PyCharm 上运行测试

如何使用 PyCharm 在远程机器上调试从 Python 调用的 C++ 代码?