On one of our OpenSUSE (11.3) virtualization servers we have had some stability problems recently. The server is pretty heavily loaded, with up to 100 virtual domains, all running on LVM snapshots. We have had the server suddenly hang (still responding on ping, unable to login).
The server writes a lot of messages from LVM into the log, but I am not sure if these are normal or not. This is a sample of the log right before the server hang:
Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:27:15 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:27:15 xen lvm[3384]: dm_task_run failed, errno = 22, Invalid argument
Oct 28 13:28:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:45 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:15 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:21 xen kernel: [76866.784142] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 28 13:31:21 xen kernel: [76866.784158] sr 2:0:0:0: [sr0] CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
Oct 28 13:31:21 xen kernel: [76866.784183] ata3.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
Oct 28 13:31:21 xen kernel: [76866.784186] res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Oct 28 13:31:21 xen kernel: [76866.784192] ata3.00: status: { DRDY }
Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:26 xen kernel: [76871.824061] ata3: link is slow to respond, please be patient (ready=0)
Oct 28 13:31:31 xen kernel: [76876.808057] ata3: device not ready (errno=-16), forcing hardreset
Oct 28 13:31:31 xen kernel: [76876.808080] ata3: soft resetting link
Oct 28 13:31:31 xen kernel: [76876.972455] ata3: nv_mode_filter: 0x739f&0x739f->0x739f, BIOS=0x7000 (0xc0000000) ACPI=0x701f (60:900:0x11)
Oct 28 13:31:31 xen kernel: [76876.988343] ata3.00: configured for UDMA/33
Oct 28 13:31:31 xen kernel: [76876.988971] ata3: EH complete
Oct 28 13:31:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:45 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:32:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:32:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:35:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:36:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:36:25 xen lvm[3384]: Another thread is handling an event. Waiting...
At these point I have four questions:
- Are these log messages normal for LVM?
- Why does dm_task_run print errors about invalid arguments in the log?
- What might cause the SATA link reset?
- Does any of this hint what might cause the server to hang?
Edit: On the next hang I got some more information:
Oct 30 01:59:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 30 02:01:15 xen kernel: [208260.160037] BUG: soft lockup - CPU#3 stuck for 67s! [qemu-dm:12987]
Oct 30 02:01:15 xen kernel: [208260.160053] BUG: soft lockup - CPU#4 stuck for 67s! [qemu-dm:26429]
Oct 30 02:01:15 xen kernel: [208260.160070] Modules linked in:
Oct 30 02:01:15 xen kernel: [208260.160074] Modules linked in: bridge stp llc bridge usbbk stp llc gntdev usbbk gntdev netbk netbk blkbk blkbk blkback hcd usbcore sd_mod dm_snapshot xenblk ehci_hcd cdrom xennet usbcore edd dm_mod dm_snapshot fan pata_amd xenblk sata_nv libata cdrom scsi_mod thermal xennet processor thermal_sys edd hwmon
Oct 30 02:01:15 xen kernel: [208260.160276] dm_modCPU 4
Oct 30 02:01:15 xen kernel: [208260.160283] Modules linked in: fan bridge stp pata_amd llc usbbk gntdev netbk sata_nv blkbk blkback_pagemap libata blktap domctl xenbus_be scsi_mod evtchn nfs lockd thermal fscache nfs_acl auth_rpcgss processor sunrpc fuse loop thermal_sys tg3 pcspkr i2c_nforce2 hwmon i2c_core ghes 8250_pnp
Oct 30 02:01:15 xen kernel: [208260.160342] CPU 3 8250
Oct 30 02:01:15 xen kernel: [208260.160346] Modules linked in: shpchp bridge stp llc k10temp usbbk gntdev netbk hed blkbk blkback_pagemap blktap domctl sr_mod xenbus_be evtchn nfs forcedeth pci_hotplug lockd serial_core serio_raw sg fscache nfs_acl button auth_rpcgss sunrpc fuse usbhid hid linear ohci_hcd loop sd_mod ehci_hcd tg3 pcspkr i2c_nforce2 usbcore i2c_core ghes 8250_pnp dm_snapshot 8250 shpchp k10temp xenblk hed sr_mod cdrom forcedeth xennet edd dm_mod pci_hotplug serial_core fan serio_raw sg pata_amd button usbhid hid sata_nv libata linear scsi_mod ohci_hcd sd_mod thermal ehci_hcd processor thermal_sys hwmon usbcore dm_snapshot
Oct 30 02:01:15 xen kernel: [208260.160471] xenblk
Oct 30 02:01:15 xen kernel: [208260.160477] cdrom
The only question I've got a solid answer for is the first one: That message is not normal for my LVM.
Googling the message doesn't bring up much other than this thread, where a guy has LVM hang on a drive failure where apparently lvm isn't able to detect that the drive has failed for some reason (further posts in the thread go into that reason but I'm not familiar enough with multipath or SANs to understand what is going on). If you're not using multipath, it probably won't do much for you.
In my experience drive resets are a sign of a failing drive, but usually failing drives will log a number of timeout errors before the kernel decides to try resetting it. Try reading the SMART information for the drive(s) involved and see if the drive thinks it's got a problem? It could also be a drive or controller firmware issue, since the first log seems to imply that the drive just suddenly stopped responding, but maybe the timeout errors were lost somewhere before the stream of lvm messages. If ata3 isn't a new drive then the firmware problem is less likely. What is interesting is that the drive is registered in UDMA/33 mode (old PATA speed) mode after resetting, implying that the drive did not reset correctly which could explain why LVM stayed hung.