I rent a dedicated server (with Intel Haswell CPU and custom hardware) at a lowcost hosting service and use it with CentOS 6.4 / 64 bit Linux (with stock kernel: 2.6.32-358.14.1.el6.x86_64).
Every few weeks it hangs and the other customers seem to have similar problems.
In the dmesg
output I see (here is the full dmesg output):
CPU0: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz stepping 03
....
NMI watchdog enabled, takes one hw-pmu counter.
....
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
iTCO_wdt: Found a Lynx Point TCO device (Version=2, TCOBASE=0x1860)
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
and in the process list I see:
# ps uawwwx|grep [w]atchdog
root 6 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/0]
root 10 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/1]
root 14 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/2]
root 18 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/3]
root 22 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/4]
root 26 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/5]
root 30 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/6]
root 34 0.0 0.0 0 0 ? S Aug22 0:00 [watchdog/7]
Does this mean, a hardware watchdog is already active at my server and will reboot my machine in under 30 seconds of being frozen?
(In the /etc/sysctl.conf I have put kernel.panic=10
, so that it doesn't stuck in kdb console anymore).
Or do I have to install and start the CentOS package watchdog
?
Well, there are a few issues to tackle here...
What happens when the server hangs? What's on the screen? What's in the logs? Do you have to engage with the hosting provider to reboot? Can you perform the reset on your own?
Your server should not be hanging, stalling or crashing!! Having worked in environments where low-end, DIY or custom hardware is used, I understand that the service provider's aim is to cut costs. However, if there's a stability issue, the onus is on the provider to remediate those issues. It's not difficult to build a stable Linux server platform. Yet, it happens more often than it should. If the combination of hardware/software/OS/firmware is toxic, that's a bad sign. The provider should be operating at a scale where they should be able to understand problems before they impact multiple clients.
Does your hardware have an IPMI device? Do YOU have IPMI access? Typically, watchdogs are part of your out-of-band management device. For instance, HP ProLiant servers have their Automatic Server Recovery (ASR) feature set to handle this.
The device your system detects is part of the Intel chipset in use. So there is technically a watchdog device and there is generic kernel support for it (it looks like it's in the CentOSPlus kernel, not the one you have). However, the watchdog package can help as a software-level watchdog, outside of the hardware hooks you may have.
But again, you're treating the symptom here. It's important to get to the root cause. If other customers are encountering these issues, you all need to resolve this with the service provider.
Linux has a generic watchdog interface. You can use it by either enabling the NMI watchdog your iTCO_wdt hardware supports or by installing and configuring a software watchdog which does not depend on the hardware.
CentOS
On Ubuntu
Then...
Of course you should know that in VIM the colon (:) button opens the menu (or rather, command line) and w tells it to write your changes, or w! forces it to, and q quits. (Also that you can use the old ZX Spectrum cursor keys - hjkl to move around, the letter d to delete and i to insert, escape to stop inserting.)
Uncomment:
See
For more... when you're done...
Yes, those processes are related to the watchdog, but unless they're configured properly, they're just sitting there doing nothing.
This should help you cope with unreliable power supplies turning random lock-ups into random reboots.
You can test it with
If it still doesn't work, you might have to sweat a little more and find out what driver your platform supports.
Personally, would try loading and testing each watchdog timer module individually, with something like this, run as root in the shell:
If it just runs through, no delays... then none of the modules seemed to work. If your PC reboots, when it boots up:
tail -1 /var/log/watchdog-test.log
Will show a likely candidate... Now make sure your server loads it...
Ubuntu seems to use the module you note here:
I haven't tested this. If you do, come and update this answer. todo Here's a hint for SuSe: https://www.suse.com/support/kb/doc?id=7016880 and for Ubuntu: https://github.com/miniwark/miniwark-howtos/wiki/Hardware-Watchdog-Timer-setup-on-Ubuntu-12.04 http://odroid.com/dokuwiki/doku.php?id=en:odroid_linux_watchdog