I've started seeing errors reported by LVM on certain Logical Volumes (and by Xen when attempting to create virtual machines on these LVs). But I've run tests on the disk, and can't see any hardware problems.
We're running a XEN/Linux (Debian Lenny) box here, running off a single SATA disk managed with LVM2. It's been in and running for more than a year, with the only major changes being recent apt-get upgrade of the kernel.
# uname -a
Linux hostname 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010 x86_64 GNU/Linux
The errors appear like this:
# vgck
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
And then when I try to start the VM which uses that LV for its C-drive (it's a Windows virtual machine), the VM refuses to start and I see this at the end of the /var/log/xen/qemu-dm-*.log
log file:
...
Register xen platform.
Done register platform.
raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x7fff02bca520, 512) [20971520] read failed -1 : 5 = Input/output error
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x12dfff0, 512) [20971520] read failed -1 : 5 = Input/output error
This first happened on 2 VMs whose disk was based on a snapshot of a third, original VM. I nuked the 2 LVs and recreated them (again by snapshotting the same, original VM's LV), and they've been fine since.
However, today I attempted to create a new VM. I snapshotted the same, orginal VM's LV (lvcreate -L500M --snapshot --name newvm-cdrive /dev/vgroup/original-cdrive
), and created the new VM. It initially worked, but after shutting down the VM once, it refuses to start up again, with the errors shown above.
My obvious first guess would be physical problems with the drive, but smartmon does not report anything:
# smartctl -t long /dev/sda
# [later]
# smartctl -l selftest /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1 -
# 2 Short offline Completed without error 00% 0 -
Also, not getting any errors from badblocks
.
I've tried running vgck
and pvck
:
# vgck vgroup -v
Using volume group(s) on command line
Finding volume group "vgroup"
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
# pvck /dev/sda2
Found label on /dev/sda2, sector 1, type=LVM2 001
Found text metadata area: offset=4096, size=192512
Have found a few references to this error message ("read failed after 0 of 4096 at...") on the Interwebs, but nothing which seems to apply to my situation.
Any ideas?
Update: As requested, below is output of lvdisplay and ls -l. Running out of COW space is plausible. How do I tell?
# lvdisplay /dev/vgroup/newvm-cdrive
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
--- Logical volume ---
LV Name /dev/vgroup/newvm-cdrive
VG Name vgroup
LV UUID jiarxt-q2NO-SyIf-5FrW-I9iq-mNEQ-iwS4EH
LV Write Access read/write
LV snapshot status INACTIVE destination for /dev/vgroup/original-cdrive
LV Status available
# open 0
LV Size 10.00 GB
Current LE 2560
COW-table size 200.00 MB
COW-table LE 50
Snapshot chunk size 4.00 KB
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 254:20
# ls -l /dev/dm-20
brw-rw---- 1 root disk 254, 20 2010-10-11 15:02 /dev/dm-20
And here's fdisk -l.
# fdisk -l /dev/sda
Disk /dev/sda: 160.0 GB, 160000000000 bytes
255 heads, 63 sectors/track, 19452 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000080
Device Boot Start End Blocks Id System
/dev/sda1 * 1 31 248976 83 Linux
/dev/sda2 32 19452 155999182+ 8e Linux LVM
Okay, I think the answer is that the COW space for the logical volume is full.
Using the command 'lvs' (which I just discovered), I see...
That capital 'S' at the start of the 'Attr' column means 'invalid Snapshot'. (A lower-case 's' would mean (valid) snapshot.) And as you can see, Snap% is 100, i.e., it's used all of its COW space.
Annoyingly,
lvdisplay
doesn't provide this information, and it doesn't tell you that your snapshot logical volume is invalid. (All it says is that the snapshot status is 'INACTIVE', which I took as meaning 'not currently in use'.) And thelvs
command is not very widely advertised. And the error message ("Input/output error") isn't very helpful--in fact there were no log messages or error messages which suggested 'snapshot is full'. (Later versions of LVM2 write messages to /var/log/messages when the space is starting to fill up, but the version in Debian Lenny doesn't. Boo.)And to compound the problem, there's no discussion of this on the internet (or at least, not that I could find)!
I did wonder why COW snapshots can't be fixed by just adding more space to the LV (using
lvextend
, but, actually, the COW space will be required not only when you write to the snapshot destination, but also when you write to the snapshot source. So once your COW area is filled up, any writes to the source LV must necessarily make the snapshot LV invalid, and not easily recoverable.(Not a direct answer, but I hope of use to others battling 100% full snapshots which cause input/output errors)
This happened to me: my snapshot become 100% full, but the file-system within it thought it had loads of space, resulting in
input/output
errors whenever I ranlvs
or any other LVM2 command.In my case the only option is to delete the snapshot with
lvremove
, but I couldn't because I had lazily unmounted the snapshot usingumount -l
. This made it very difficult to track down which processes were using the until-recently-mounted file-system.I found success by obtaining the logical volume's major + minor device numbers, e.g.
252:10
in the following:If you run
lsof
as root, without arguments, you will get a full listing of the open files on the system. Filter for your major + minor block device numbers, separated by a comma, not a colon as above, and you may find the process using it:Note that the
NAME
is/
, because it has been lazily unmounted,lsof
cannot resolve its original path name.Kill this process,
2055
in this example, and trylvremove
et al again.