batfastad

Asked: 2013-11-23 15:25:49 +0800 CST2013-11-23 15:25:49 +0800 CST 2013-11-23 15:25:49 +0800 CST

KVM LVM-based guest... kernel: Buffer I/O error on device. Failing drive?

Currently setting up a small KVM host to run a few VMs for a small business. The server has 2 drives in software md RAID 1, then I have it set as a PV in an LVM setup. Guests and hosts are all CentOS 6.4 64bit.

The / partition of the KVM guests are disk images but with one particular guest that will have higher i/o requirements, I've added a 2nd HDD to the VM which is a logical volume from the host storage pool.

I was running some pretty intense i/o on this LV this evening within the guest, extracting a 60GB multi-volume 7z archive of data. 7z barfed about 1/5th of the way through with E_FAIL. I tried to move some files around on this LV disk and was greeted with "cannot move ... read-only file system". All devices are mounted rw. I looked in /var/log/messages and saw the following errors...

Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307631
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307632
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307633
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378473448
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378474456
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378475464
Nov 22 21:47:55 mail kernel: JBD: Detected IO errors while flushing file data on vdb1
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255779688
Nov 22 21:47:55 mail kernel: Aborting journal on device vdb1.
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255596560
Nov 22 21:47:55 mail kernel: JBD: I/O error detected when updating journal superblock for vdb1.
Nov 22 21:48:06 mail kernel: __ratelimit: 20 callbacks suppressed
Nov 22 21:48:06 mail kernel: __ratelimit: 2295 callbacks suppressed
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47270479
Nov 22 21:48:06 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47271504
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378116680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378157680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378432440
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: ext3_journal_start_sb: Detected aborted journal
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: remounting filesystem read-only
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003839
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 256
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 32
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64
Nov 22 21:51:55 mail kernel: end_request: I/O error, dev vdb, sector 6144
Nov 22 21:55:06 mail yum[19139]: Installed: lsof-4.82-4.el6.x86_64
Nov 22 21:59:47 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 512

There were plenty more than that, full excerpt here http://pastebin.com/vH8SDrCg
Note the point where there's an i/o error when "updating journal superblock" then later the volume is re-mounted as read-only because of an aborted journal.

So time to look at the host.

cat /proc/mdstat returns UU for both RAID 1 arrays (boot and main PV).
mdadm --detail shows state: clean and state: active respectively
The typical LVM commands pvs, vgs and lvs all return the following error:

  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160711680: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160769024: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 0: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 4096: Input/output error
  VG         #PV #LV #SN Attr   VSize   VFree  
  VolGroup00   1   4   1 wz--n- 930.75g 656.38g

/var/log/messages on the host shows this:

Nov 22 21:47:53 localhost kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 1
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 2
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 3
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0

A short self-test with smartctl revealed nothing on either physical disk. There were no worrying error counters in the SMART data either, most all at 0 apart from power on hours, spin up time, temperature. Even power on hours were relatively low, about 150 days or so. I currently have long self-tests in progress.

So based on all that, what's the likelihood of this being the start of a drive failure?
Worth running an fsck or badblocks in the host?
I don't want to cause a full kernel panic at this stage. I would have thought mdstat would have shown a failed array member by now, about 1hr after the event.

This machine is a dedicated server so I don't have physical access. I'll check the console through the DRAC shortly but I'm expecting to see a bunch of i/o errors on the console. I don't have virtual media access so can't load systemrescuecd to do repairs, so I'm a bit wary of rebooting at this stage.

KVM LVM-based guest... kernel: Buffer I/O error on device. Failing drive?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

KVM LVM-based guest... kernel: Buffer I/O error on device. Failing drive?

0 Answers