EDAC DIMM CE Error错误导致服务器重启

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了EDAC DIMM CE Error错误导致服务器重启相关的知识,希望对你有一定的参考价值。

服务器一:

[[email protected] ~]# tailf /var/log/messages
May  8 09:10:59 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
May  8 09:10:59 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 1: 940000000000009f
May  8 09:10:59 localhost kernel: TSC 1c434c49f9ef794 ADDR 6f9326740 MISC 0 PROCESSOR 0:306f2 TIME 1525741859 SOCKET 0 APIC 0
May  8 09:10:59 localhost kernel: EDAC MC0: CE row 1, channel 0, label "CPU_SrcID#0_Channel#1_DIMM#0": 0 Unknown error(s): memory read on FATAL area : cpu=0 Err=0000:009f (ch=15), addr = 0x6f9326740 => socket=0, Channel=1(mask=2), rank=0

这个是EDAC (Error Detection AndCorrection) 的日志.
按照上面的文档, 找出错误的DIMM:

[[email protected] ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:2
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:4
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0

根据错误日志:
May 8 09:10:59 localhost kernel: EDAC MC0: CE row 1, channel 0, label "CPU_SrcID#0_Channel#1_DIMM#0": 0 Unknown error(s): memory read on FATAL area : cpu=0 Err=0000:009f (ch=15), addr = 0x6f9326740 => socket=0, Channel=1(mask=2), rank=0

[[email protected] ~]# cat /sys/devices/system/edac/mc/mc0/csrow1/ch0_dimm_label 
CPU_SrcID#0_Channel#1_DIMM#0

[[email protected] ~]# dmidecode -t memory |grep ‘Locator: DIMM‘

最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错

参考博文:
http://blog.tankywoo.com/2014/12/02/edac-dimm-ce-error.html
http://serverfault.com/questions/648240/how-can-i-find-which-memory-have-ce-error
https://blog.csdn.net/odailidong/article/details/46865255

服务器二:机器内存条报错:

May  8 17:09:08 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
May  8 17:09:08 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
May  8 17:09:08 localhost kernel: TSC 0 ADDR 4e92c9c0 MISC 4076f686 PROCESSOR 0:206d6 TIME 1525770548 SOCKET 0 APIC 0
May  8 17:09:08 localhost kernel: sbridge: HANDLING MCE MEMORY ERROR
May  8 17:09:08 localhost kernel: CPU 0: Machine Check Exception: 0 Bank 11: 8800004a00800093
May  8 17:09:08 localhost kernel: TSC 0 ADDR 0 MISC 5221001000101400 PROCESSOR 0:206d6 TIME 1525770548 SOCKET 0 APIC 0
May  8 17:09:09 localhost kernel: EDAC MC0: CE row 6, channel 0, label "CPU_SrcID#0_Ha#0_Channel#3_DIMM": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x4e92c9c0 => socket=0, ha=0, Channel=3(mask=8), rank=1
May  8 17:09:09 localhost kernel: 

按照上面的文档, 找出错误的DIMM:
[```
[email protected] ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc/csrow/ch_ce_count|wc -l
16
[[email protected] ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc
/csrow/ch_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:1
/sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0

[[email protected] ~]# cat /sys/devices/system/edac/mc/mc0/csrow6/ch0_dimm_label
CPU_SrcID#0_Ha#0_Channel#3_DIMM
[[email protected] csrow6]# cat /sys/devices/system/edac/mc/mc0/csrow6/mem_type
Registered-DDR3



确定坏的内存条的位置

以上是关于EDAC DIMM CE Error错误导致服务器重启的主要内容,如果未能解决你的问题,请参考以下文章

外形封装:DIMM 类型:SDRAM 详细类型:Synchronous 这样显示是几代的内存条?

Linux 5.18 EDAC继续为AMD Zen 4做准备

服务器内存故障预测居然可以这样做!

***Error 然后 SocketException:软件导致连接中止:套接字写入错误 [重复]

打开一些网址会出现Network Error (tcp_error)的错误

C++ 中的导致编译错误汇总