we have few DELL machines ( with RHEL 7.6
) , and as we replaced the DIMM cards on machines because the Erros that we seen from kernel messages
after some time we checked again the kernel messages and we found the following and we can see the errors about the RAM memory ( also related to RHEL case - https://access.redhat.com/solutions/6961932 )
[Mon May 8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May 8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May 8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Error 0, type: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: fru_text: B6
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: section_type: memory error
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_status: 0x0000000000000400
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: physical_address: 0x000000446e0d5f00
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_type: 2, single-bit ECC
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: MISC 0
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May 9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May 9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported
just to be sure that this above messages are not random messages we decided to reboot machines and see if the bad messages about memory are reproduced
but the Erros messages about the RAM memory, are still remained.
so we are confused about the problem that we seen from kernel messages
how it can be that we still get Erros about RAM in spite we replaced the DIMM cards
I must gives here additional information about what we see from IDRAC
as we can above IDRAC not completed about the DIMM cards or about the RAM memory
so the question is - how comes dmesg
( kernel messages ) are complained about the RAM memory in spite all DIMMs was replaced?
is it possible that something else is BAD and not the DIMM cards? for example the motherboard in DELL machine?
The error you see is single-bit ECC correctable memory error that was corrected by hardware. These do not trigger a component listed as failed in iDRAC, at least until their number exceeds some internally defined threshold, but you should see this memory error logged under iDRAC SEL (system event log).
It is not recommended to mix single and dual rank modules, but your mileage may vary depending on processor/motherboard version.