On our virtualisation server with KVM, cpu cores are disabling and enabling in a loop after 10 minutes (every disable results in 15 seconds hang for all virtual machines).
It happens from thunderstorm before a week, when all virtual servers was hanged due to data disk error (system disk was ok). So we changed data disk. Next, we tried upgrade host system from ubuntu natty (kernel 2.6) to ubuntu precise (3.2), with no change.
I found only one forum about it, without solution http://ubuntuforums.org/showthread.php?p=12071553
I tried switch on kvm debug
/sys/kernel/debug/tracing/trace_pipe
and find exact place by kernel time in syslog, but i don't undestand log and didn't see any important difference
I think it could be some bad signal from motherboard. Due to disk error, it could happen something with motherboard, but i don't know how to find
There is syslog part with one disable/enable loop
Jul 14 15:36:44 node-01 kernel: [56713.568733] kvm: disabling virtualization on CPU1
Jul 14 15:36:44 node-01 kernel: [56713.668842] CPU 1 is now offline
Jul 14 15:36:44 node-01 kernel: [56713.670835] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5
Jul 14 15:36:44 node-01 kernel: [56713.673771] kvm: disabling virtualization on CPU2
Jul 14 15:36:44 node-01 kernel: [56713.674492] CPU 2 is now offline
Jul 14 15:36:44 node-01 kernel: [56713.680172] kvm: disabling virtualization on CPU3
Jul 14 15:36:44 node-01 kernel: [56713.681114] CPU 3 is now offline
Jul 14 15:36:44 node-01 kernel: [56713.681119] SMP alternatives: switching to UP code
Jul 14 15:36:44 node-01 kernel: [56713.701971] init: anacron main process (3613) killed by TERM signal
Jul 14 15:36:44 node-01 kernel: [56713.709803] r8169 0000:01:00.0: eth0: link down
Jul 14 15:36:44 node-01 kernel: [56713.710421] br0: port 1(eth0) entering forwarding state
Jul 14 15:36:47 node-01 kernel: [56716.675313] r8169 0000:01:00.0: eth0: link up
Jul 14 15:36:47 node-01 kernel: [56716.676438] br0: port 1(eth0) entering forwarding state
Jul 14 15:36:47 node-01 kernel: [56716.676454] br0: port 1(eth0) entering forwarding state
Jul 14 15:36:56 node-01 kernel: [56725.666787] br0: port 1(eth0) entering forwarding state
Jul 14 15:37:02 node-01 kernel: [56730.815937] SMP alternatives: switching to SMP code
Jul 14 15:37:02 node-01 kernel: [56730.825021] Booting Node 0 Processor 1 APIC 0x4
Jul 14 15:37:02 node-01 kernel: [56730.825025] smpboot cpu 1: start_ip = 9a000
Jul 14 15:37:02 node-01 kernel: [56730.836033] Calibrating delay loop (skipped) already calibrated this CPU
Jul 14 15:37:02 node-01 kernel: [56730.837012] kvm: enabling virtualization on CPU1
Jul 14 15:37:02 node-01 kernel: [56730.858555] NMI watchdog enabled, takes one hw-pmu counter.
Jul 14 15:37:02 node-01 kernel: [56730.862547] Booting Node 0 Processor 2 APIC 0x1
Jul 14 15:37:02 node-01 kernel: [56730.862551] smpboot cpu 2: start_ip = 9a000
Jul 14 15:37:02 node-01 kernel: [56730.873460] Calibrating delay loop (skipped) already calibrated this CPU
Jul 14 15:37:02 node-01 kernel: [56730.874453] kvm: enabling virtualization on CPU2
Jul 14 15:37:02 node-01 kernel: [56730.896371] NMI watchdog enabled, takes one hw-pmu counter.
Jul 14 15:37:02 node-01 kernel: [56730.898581] Booting Node 0 Processor 3 APIC 0x5
Jul 14 15:37:02 node-01 kernel: [56730.898586] smpboot cpu 3: start_ip = 9a000
Jul 14 15:37:02 node-01 kernel: [56730.909496] Calibrating delay loop (skipped) already calibrated this CPU
Jul 14 15:37:02 node-01 kernel: [56730.910227] kvm: enabling virtualization on CPU3
Jul 14 15:37:02 node-01 kernel: [56730.930644] NMI watchdog enabled, takes one hw-pmu counter.
Jul 14 15:37:02 node-01 kernel: [56730.963737] r8169 0000:01:00.0: eth0: link down
Jul 14 15:37:02 node-01 kernel: [56730.964069] br0: port 1(eth0) entering forwarding state
Jul 14 15:37:04 node-01 kernel: [56733.432535] r8169 0000:01:00.0: eth0: link up
Jul 14 15:37:04 node-01 kernel: [56733.433808] br0: port 1(eth0) entering forwarding state
Jul 14 15:37:04 node-01 kernel: [56733.433823] br0: port 1(eth0) entering forwarding state
Jul 14 15:37:13 node-01 kernel: [56742.424751] br0: port 1(eth0) entering forwarding state
Thank you for any tip, how to find an error.
In our case, this behaviour start after disk error (and previous thunderstorm maybe electricity surge). So i don't know if there is some bad signal from motherboard about frequency/power/sleep etc. or it was bad configuration of pm-utils.
Uninstalling package pm-utils, resolved this issue.
Before, we tried upgrade distro from ubuntu natty (kernel 2.6) to ubuntu precise (kernel 3.2), but with no success.
Other thing i tried was disabling posibility of enabling/disabling cpu cores (via /sys/devices/system/cpu/cpu*/online file).
There is kernel option nr_cpus= which can be set to number of used processors (cores). Setting this should disable hot plug of cpu. But in my case, after setting this to grub boot parameters, it has no effect (instead of missing /sys/devices/system/cpu/cpu*/online file).