Currently setting up a small KVM host to run a few VMs for a small business. The server has 2 drives in software md RAID 1, then I have it set as a PV in an LVM setup. Guests and hosts are all CentOS 6.4 64bit.
The /
partition of the KVM guests are disk images but with one particular guest that will have higher i/o requirements, I've added a 2nd HDD to the VM which is a logical volume from the host storage pool.
I was running some pretty intense i/o on this LV this evening within the guest, extracting a 60GB multi-volume 7z archive of data. 7z barfed about 1/5th of the way through with E_FAIL
. I tried to move some files around on this LV
disk and was greeted with "cannot move ... read-only file system". All devices are mounted rw
. I looked in /var/log/messages
and saw the following errors...
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307631
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307632
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307633
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378473448
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378474456
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378475464
Nov 22 21:47:55 mail kernel: JBD: Detected IO errors while flushing file data on vdb1
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255779688
Nov 22 21:47:55 mail kernel: Aborting journal on device vdb1.
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255596560
Nov 22 21:47:55 mail kernel: JBD: I/O error detected when updating journal superblock for vdb1.
Nov 22 21:48:06 mail kernel: __ratelimit: 20 callbacks suppressed
Nov 22 21:48:06 mail kernel: __ratelimit: 2295 callbacks suppressed
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47270479
Nov 22 21:48:06 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47271504
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378116680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378157680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378432440
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: ext3_journal_start_sb: Detected aborted journal
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: remounting filesystem read-only
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003839
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 256
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 32
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64
Nov 22 21:51:55 mail kernel: end_request: I/O error, dev vdb, sector 6144
Nov 22 21:55:06 mail yum[19139]: Installed: lsof-4.82-4.el6.x86_64
Nov 22 21:59:47 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 512
There were plenty more than that, full excerpt here http://pastebin.com/vH8SDrCg
Note the point where there's an i/o error when "updating journal superblock" then later the volume is re-mounted as read-only because of an aborted journal.
So time to look at the host.
cat /proc/mdstat
returnsUU
for both RAID 1 arrays (boot
and main PV).mdadm --detail
showsstate: clean
andstate: active
respectivelyThe typical LVM commands
pvs
,vgs
andlvs
all return the following error:
/dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160711680: Input/output error
/dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160769024: Input/output error
/dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 0: Input/output error
/dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 4096: Input/output error
VG #PV #LV #SN Attr VSize VFree
VolGroup00 1 4 1 wz--n- 930.75g 656.38g
/var/log/messages
on the host shows this:
Nov 22 21:47:53 localhost kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 1
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 2
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 3
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
- A short self-test with
smartctl
revealed nothing on either physical disk. There were no worrying error counters in the SMART data either, most all at0
apart from power on hours, spin up time, temperature. Even power on hours were relatively low, about 150 days or so. I currently have long self-tests in progress.
So based on all that, what's the likelihood of this being the start of a drive failure?
Worth running an fsck
or badblocks
in the host?
I don't want to cause a full kernel panic at this stage. I would have thought mdstat
would have shown a failed array member by now, about 1hr after the event.
This machine is a dedicated server so I don't have physical access. I'll check the console through the DRAC shortly but I'm expecting to see a bunch of i/o errors on the console. I don't have virtual media access so can't load systemrescuecd to do repairs, so I'm a bit wary of rebooting at this stage.
0 Answers