Ping a Specific Port

Question

King David

Asked: 2023-05-10 20:55:15 +0800 CST2023-05-10 20:55:15 +0800 CST 2023-05-10 20:55:15 +0800 CST

kernel messages are complained about memory.inspite all DIMM cards was replaced

772

we have few DELL machines ( with RHEL 7.6) , and as we replaced the DIMM cards on machines because the Erros that we seen from kernel messages

after some time we checked again the kernel messages and we found the following and we can see the errors about the RAM memory ( also related to RHEL case - https://access.redhat.com/solutions/6961932 )

[Mon May  8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May  8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May  8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  Error 0, type: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  fru_text: B6
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   section_type: memory error
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_status: 0x0000000000000400
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   physical_address: 0x000000446e0d5f00
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888 
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: MISC 0 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May  9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May  9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported

just to be sure that this above messages are not random messages we decided to reboot machines and see if the bad messages about memory are reproduced

but the Erros messages about the RAM memory, are still remained.

so we are confused about the problem that we seen from kernel messages

how it can be that we still get Erros about RAM in spite we replaced the DIMM cards

I must gives here additional information about what we see from IDRAC

as we can above IDRAC not completed about the DIMM cards or about the RAM memory

so the question is - how comes dmesg ( kernel messages ) are complained about the RAM memory in spite all DIMMs was replaced?

is it possible that something else is BAD and not the DIMM cards? for example the motherboard in DELL machine?

1 Answers

Voted

Peter Zhabin · Answer 1 · 2023-05-10T22:32:47+08:00

Best Answer

Peter Zhabin

2023-05-10T22:32:47+08:002023-05-10T22:32:47+08:00

The error you see is single-bit ECC correctable memory error that was corrected by hardware. These do not trigger a component listed as failed in iDRAC, at least until their number exceeds some internally defined threshold, but you should see this memory error logged under iDRAC SEL (system event log).

It is not recommended to mix single and dual rank modules, but your mileage may vary depending on processor/motherboard version.

2

kernel messages are complained about memory.inspite all DIMM cards was replaced

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?