如何在 HPC(Argon)上运行 Keras 时解决“内存不足”问题?

Posted

技术标签:

【中文标题】如何在 HPC(Argon)上运行 Keras 时解决“内存不足”问题?【英文标题】:How can I solve "ran out of memory" while running Keras on HPC (Argon)? 【发布时间】:2018-09-16 00:24:24 【问题描述】:

我有一个用 Keras 编码的 ConvLSTM 神经网络。我将相同的代码提交到集群上的两个队列(一个 GPU 和另一个 CPU)。 我在 CPU 上的代码正在运行,但在 GPU 上出现错误,下面我复制了 一行 错误文件:

"W tensorflow/core/common_runtime/bfc_allocator.cc:273] 分配器 (GPU_0_bfc) 尝试分配 3.12MiB 时内存不足。当前的 分配摘要如下。 "

错误文件:

Using TensorFlow backend.
2018-04-05 17:39:59.059431: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-04-05 17:40:00.220946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:81:00.0
totalMemory: 15.90GiB freeMemory: 332.94MiB
2018-04-05 17:40:00.221266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:81:00.0, compute capability: 6.0)
/opt/apps/python/2.7.14_openmpi-2.1.2_parallel_studio-2017.4/lib/python2.7/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype uint8 was converted to float64 by MinMaxScaler.
  warnings.warn(msg, DataConversionWarning)
2018-04-05 17:40:50.577736: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.12MiB.  Current allocation summary follows.
2018-04-05 17:40:50.578144: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (256):   Total Chunks: 296, Chunks in use: 294. 74.0KiB allocated for chunks. 73.5KiB in use in bin. 9.3KiB client-requested in use in bin.
2018-04-05 17:40:50.578167: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (512):   Total Chunks: 39, Chunks in use: 39. 22.0KiB allocated for chunks. 22.0KiB in use in bin. 16.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578179: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578192: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2048):  Total Chunks: 14, Chunks in use: 14. 36.8KiB allocated for chunks. 36.8KiB in use in bin. 34.5KiB client-requested in use in bin.
2018-04-05 17:40:50.578203: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578216: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8192):  Total Chunks: 62, Chunks in use: 61. 882.2KiB allocated for chunks. 869.2KiB in use in bin. 857.8KiB client-requested in use in bin.
2018-04-05 17:40:50.578228: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16384):     Total Chunks: 13, Chunks in use: 12. 223.0KiB allocated for chunks. 198.8KiB in use in bin. 190.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578239: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (32768):     Total Chunks: 46, Chunks in use: 46. 2.53MiB allocated for chunks. 2.53MiB in use in bin. 2.53MiB client-requested in use in bin.
2018-04-05 17:40:50.578251: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (65536):     Total Chunks: 168, Chunks in use: 168. 13.19MiB allocated for chunks. 13.19MiB in use in bin. 13.10MiB client-requested in use in bin.
2018-04-05 17:40:50.578263: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (131072):    Total Chunks: 1, Chunks in use: 1. 135.8KiB allocated for chunks. 135.8KiB in use in bin. 80.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578276: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (262144):    Total Chunks: 243, Chunks in use: 243. 76.74MiB allocated for chunks. 76.74MiB in use in bin. 75.94MiB client-requested in use in bin.
2018-04-05 17:40:50.578287: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (524288):    Total Chunks: 3, Chunks in use: 3. 1.64MiB allocated for chunks. 1.64MiB in use in bin. 960.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578297: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578309: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2097152):   Total Chunks: 4, Chunks in use: 4. 12.50MiB allocated for chunks. 12.50MiB in use in bin. 12.50MiB client-requested in use in bin.
2018-04-05 17:40:50.578336: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578348: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8388608):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578358: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16777216):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578367: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578376: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (67108864):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578386: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578395: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578406: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin for 3.12MiB was 2.00MiB, Chunk State: 
2018-04-05 17:40:50.578417: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000000 of size 1280
2018-04-05 17:40:50.578426: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000500 of size 256
2018-04-05 17:40:50.578433: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000600 of size 256
2018-04-05 17:40:50.578440: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000700 of size 57600
2018-04-05 17:40:50.578448: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00e800 of size 512
2018-04-05 17:40:50.578456: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ea00 of size 768
2018-04-05 17:40:50.578464: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ed00 of size 256
2018-04-05 17:40:50.578471: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ee00 of size 256
2018-04-05 17:40:50.578478: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ef00 of size 256
2018-04-05 17:40:50.578485: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f000 of size 256
2018-04-05 17:40:50.578493: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f100 of size 256
2018-04-05 17:40:50.578500: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f200 of size 256
2018-04-05 17:40:50.578507: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f300 of size 256
2018-04-05 17:40:50.578514: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f400 of size 256
2018-04-05 17:40:50.578522: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f500 of size 256
2018-04-05 17:40:50.578529: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f600 of size 57600
2018-04-05 17:40:50.578536: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d700 of size 512
2018-04-05 17:40:50.578544: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d900 of size 3072
2018-04-05 17:40:50.578551: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01e500 of size 57600
2018-04-05 17:40:50.578559: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c600 of size 512
2018-04-05 17:40:50.578571: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c800 of size 768
2018-04-05 17:40:50.578579: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cb00 of size 256
2018-04-05 17:40:50.578586: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cc00 of size 256
2018-04-05 17:40:50.578593: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cd00 of size 256
2018-04-05 17:40:50.578600: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02ce00 of size 256
2018-04-05 17:40:50.578607: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cf00 of size 256
2018-04-05 17:40:50.578614: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d000 of size 256
2018-04-05 17:40:50.578622: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d100 of size 256
2018-04-05 17:40:50.578629: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d200 of size 14592
2018-04-05 17:40:50.578637: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030b00 of size 256
2018-04-05 17:40:50.578644: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030c00 of size 256
2018-04-05 17:40:50.578652: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030d00 of size 256
2018-04-05 17:40:50.578659: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030e00 of size 256
2018-04-05 17:40:50.578666: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030f00 of size 256
2018-04-05 17:40:50.578673: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031000 of size 256
2018-04-05 17:40:50.578681: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031100 of size 256
2018-04-05 17:40:50.578688: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031200 of size 256
2018-04-05 17:40:50.578695: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031300 of size 512
2018-04-05 17:40:50.578702: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031500 of size 14592
2018-04-05 17:40:50.578709: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034e00 of size 256
2018-04-05 17:40:50.578717: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034f00 of size 256
2018-04-05 17:40:50.578724: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035000 of size 256
2018-04-05 17:40:50.578731: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035100 of size 256
2018-04-05 17:40:50.578738: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035200 of size 256
2018-04-05 17:40:50.578746: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035300 of size 256
2018-04-05 17:40:50.578753: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035400 of size 256
2018-04-05 17:40:50.578760: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035500 of size 256
2018-04-05 17:40:50.578767: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035600 of size 512
2018-04-05 17:40:50.578775: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035800 of size 23296
2018-04-05 17:40:50.578782: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c03b300 of size 57600
2018-04-05 17:40:50.578789: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049400 of size 512
2018-04-05 17:40:50.578797: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049600 of size 57600
2018-04-05 17:40:50.578804: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c057700 of size 57600
2018-04-05 17:40:50.578811: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065800 of size 256
2018-04-05 17:40:50.578823: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065900 of size 256
2018-04-05 17:40:50.578830: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065a00 of size 256
2018-04-05 17:40:50.578838: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065b00 of size 256
2018-04-05 17:40:50.578845: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065c00 of size 256
2018-04-05 17:40:50.578852: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065d00 of size 256
2018-04-05 17:40:50.578859: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065e00 of size 256
2018-04-05 17:40:50.578867: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065f00 of size 256
2018-04-05 17:40:50.578874: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066000 of size 512
2018-04-05 17:40:50.578881: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066200 of size 14592
2018-04-05 17:40:50.578888: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069b00 of size 256
2018-04-05 17:40:50.578896: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069c00 of size 256

【问题讨论】:

【参考方案1】:

CPU 上的 tensorflow 需要将数据加载到内存中,而 GPU 上的 tensorflow 需要 GPU 内存中的数据。这很可能是您的错误的原因。您可以尝试减小批量大小。

【讨论】:

你能在遇到错误之前验证gpu上使用了多少内存吗? 我编辑了帖子并添加了错误文件的第一部分。我不确定它是否包含您要求的信息,如果没有,我该如何获取? 好吧,没看到任何有用的东西。在不了解您的设置的情况下,除了您的 gpu 内存不足以完成此任务之外,很难给出建议。如果您无法进一步减小批量大小,则模型可能对您的 GPU 来说太大了。你知道显卡有多少gpu内存吗? 我在大学拥有的集群上运行它,我试图找出答案。但模型并不大。它有 4 层 ConvLSTM,每层 20 个过滤器 只需在变量中打开一个 tf Session 即可。 Keras 会自动关联到它

以上是关于如何在 HPC(Argon)上运行 Keras 时解决“内存不足”问题?的主要内容,如果未能解决你的问题,请参考以下文章

R在HPC MPIcluster上运行foreach dopar循环

如何在 ubuntu20 上以 theano 为后端的 termux 运行 keras

尝试编译 argon2_elixir 时,nmake 失败

Keras FAQ: 常见问题解答

4-HPC场景下Volcano批量调度能力实践

如何在多核上运行 Keras?