在 128 个内核上进行大型模拟的内存出现硬件错误
Posted
技术标签:
【中文标题】在 128 个内核上进行大型模拟的内存出现硬件错误【英文标题】:Hardware errors on Memory with a large simulation on 128 cores 【发布时间】:2021-01-11 03:05:41 【问题描述】:我已经在天体物理学方面启动了一个大型模拟(enzo 代码),在 128 个内核上执行 MPI,如下所示:
mpirun -np 128 ./enzo.exe amr_cosmology.enzo
并且在运行过程中出现以下错误:它被标记为Hardware Error
,因此我得出结论,总内存(1GB)的一根棒是坏的。如您所见,代码不会停止,但这些错误消息在整个代码运行过程中经常出现:
TopGrid dt = 3.705042e-02 time = 1.2350099725762 cycle = 14 z = 834.55610989934
TopGrid dt = 3.816191e-02 time = 1.272060395839 cycle = 15 z = 818.25224654732
TopGrid dt = 3.930675e-02 time = 1.3102223091899 cycle = 16 z = 802.26651295398
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711318] [Hardware Error]: Corrected error, no action required.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711377] [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2041000000011b
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711387] [Hardware Error]: Error Addr: 0x0000001c9f3d4ac0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711388] [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x0f5940000a801001
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711399] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711407] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711422] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711474] [Hardware Error]: Corrected error, no action required.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711479] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711483] [Hardware Error]: Error Addr: 0x0000001ee2f9b140
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711484] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xda9020000a800d01
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711489] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711492] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711497] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
TopGrid dt = 4.048593e-02 time = 1.3495290567141 cycle = 17 z = 786.59270291163
TopGrid dt = 4.170048e-02 time = 1.3900149827028 cycle = 18 z = 771.22472945212
TopGrid dt = 4.295147e-02 time = 1.4317154617942 cycle = 19 z = 756.15662471201
这是什么类型的错误:是自动纠正还是确实是硬件故障?无论如何,有些不对劲。
【问题讨论】:
【参考方案1】:这是由于 RAM 故障造成的。频繁的 ECC 纠错(例如您的情况)定义了有故障的硬件。修复是找出导致此问题的内存并更换它。如果它不是关键系统,您可能不需要立即修复它。
在某些情况下,未按预期频率工作的 RAM 也会导致此问题。
有关详细信息,请参阅参考资料。 Ref 1、Ref 2、Ref 3
【讨论】:
以上是关于在 128 个内核上进行大型模拟的内存出现硬件错误的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas:将 2,000,000 个 DataFrame 行转换为二进制矩阵 (pd.get_dummies()) 而不会出现内存错误?
在思科交换机模拟软件上进行端口聚合实验,使用命令 switch(config)#interface port-group 1 却老提示错误