im facing extremely weird issue regards one server, it random freeze/hang with no output on server, and not responding to short keys, and required cold boot, when boot with cold boot, no errors on boot screen at all.
It's not freezing under heavy load at all, with around 9-20% cpu wheb crash, load average around 2-5(12 core cpu) and 128gb ram
We tried check logs, nothing shows like kernal panics, or anything that relate to the issue itself.
In all the freezes after cold boot, when we check the log, we do see normal OOM reaper killing php procces (users reach limits) but nothing too abusive, but always on OOM, Sometimes when server freeze in the log you see the current time, and sometimes like the it shows after thr current time of the crash few lines from older date, and freezes.
Nothing in logs can determine software related, or under heavy load, just normal operation, this is an upgraded machine from old one, that were stable for years.. The freezes are random, could be after a week server up, or two days or three weeks and etc...
Also we tried to extract vmcore dump of server freeze but still nothing catches there.
It's just freeze with not screen output, but server still running but not pringable, cant access ssh nothing, also kvm as i said show no output at all at screen.
Could it be related to maybe faulty hardware? As my suspension is about faulty RAM?
I'm extremely lost with this issue.. Thanks
lm-sensors
, and check the temps with thesensors
command.We just migrated to another server, but after searching alot and trying debugging alot, looks like hardware issue regards the motherboard as i checked in some forums regards motherboards from asrock rack and ryzen cpus i manage to find few cases around same issue even wih windows 10 or windows server getting blue screen of death. as the OS support suggested in this case not to change the motherboard brand as could be risky to be refused to boot up, and to migrate to a new server as we did. after we migrated to new server, all issues resolved. so i guess it does relate to hardware issue and not software.