Ping a Specific Port

Question

andrewf

Asked: 2010-10-12 09:12:24 +0800 CST2010-10-12 09:12:24 +0800 CST 2010-10-12 09:12:24 +0800 CST

LVM is reporting I/O errors, but the disk reports no problems. Argh

772

I've started seeing errors reported by LVM on certain Logical Volumes (and by Xen when attempting to create virtual machines on these LVs). But I've run tests on the disk, and can't see any hardware problems.

We're running a XEN/Linux (Debian Lenny) box here, running off a single SATA disk managed with LVM2. It's been in and running for more than a year, with the only major changes being recent apt-get upgrade of the kernel.

# uname -a
Linux hostname 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010 x86_64 GNU/Linux

The errors appear like this:

# vgck
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error

And then when I try to start the VM which uses that LV for its C-drive (it's a Windows virtual machine), the VM refuses to start and I see this at the end of the /var/log/xen/qemu-dm-*.log log file:

...
Register xen platform.
Done register platform.
raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x7fff02bca520, 512) [20971520] read failed -1 : 5 = Input/output error
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x12dfff0, 512) [20971520] read failed -1 : 5 = Input/output error

This first happened on 2 VMs whose disk was based on a snapshot of a third, original VM. I nuked the 2 LVs and recreated them (again by snapshotting the same, original VM's LV), and they've been fine since.

However, today I attempted to create a new VM. I snapshotted the same, orginal VM's LV (lvcreate -L500M --snapshot --name newvm-cdrive /dev/vgroup/original-cdrive), and created the new VM. It initially worked, but after shutting down the VM once, it refuses to start up again, with the errors shown above.

My obvious first guess would be physical problems with the drive, but smartmon does not report anything:

# smartctl -t long /dev/sda
# [later]
# smartctl -l selftest /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -

Also, not getting any errors from badblocks.

I've tried running vgck and pvck:

# vgck vgroup -v
    Using volume group(s) on command line
    Finding volume group "vgroup"
  /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error

# pvck /dev/sda2
  Found label on /dev/sda2, sector 1, type=LVM2 001
  Found text metadata area: offset=4096, size=192512

Have found a few references to this error message ("read failed after 0 of 4096 at...") on the Interwebs, but nothing which seems to apply to my situation.

Any ideas?

Update: As requested, below is output of lvdisplay and ls -l. Running out of COW space is plausible. How do I tell?

# lvdisplay /dev/vgroup/newvm-cdrive
  /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vgroup/newvm-cdrive
  VG Name                vgroup
  LV UUID                jiarxt-q2NO-SyIf-5FrW-I9iq-mNEQ-iwS4EH
  LV Write Access        read/write
  LV snapshot status     INACTIVE destination for /dev/vgroup/original-cdrive
  LV Status              available
  # open                 0
  LV Size                10.00 GB
  Current LE             2560
  COW-table size         200.00 MB
  COW-table LE           50
  Snapshot chunk size    4.00 KB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:20

# ls -l /dev/dm-20
brw-rw---- 1 root disk 254, 20 2010-10-11 15:02 /dev/dm-20

And here's fdisk -l.

# fdisk -l /dev/sda

Disk /dev/sda: 160.0 GB, 160000000000 bytes
255 heads, 63 sectors/track, 19452 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000080

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          31      248976   83  Linux
/dev/sda2              32       19452   155999182+  8e  Linux LVM

2 Answers

Voted

andrewf · Answer 1 · 2010-10-13T02:11:52+08:00

Okay, I think the answer is that the COW space for the logical volume is full.

Using the command 'lvs' (which I just discovered), I see...

# lvs
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
LV             VG      Attr   LSize   Origin          Snap%  Move Log Copy%  Convert
[...other LVs...]
newvm-cdrive   mrburns Swi-I-   2.00G original-cdrive 100.00
[...other LVs...]

That capital 'S' at the start of the 'Attr' column means 'invalid Snapshot'. (A lower-case 's' would mean (valid) snapshot.) And as you can see, Snap% is 100, i.e., it's used all of its COW space.

Annoyingly, lvdisplay doesn't provide this information, and it doesn't tell you that your snapshot logical volume is invalid. (All it says is that the snapshot status is 'INACTIVE', which I took as meaning 'not currently in use'.) And the lvs command is not very widely advertised. And the error message ("Input/output error") isn't very helpful--in fact there were no log messages or error messages which suggested 'snapshot is full'. (Later versions of LVM2 write messages to /var/log/messages when the space is starting to fill up, but the version in Debian Lenny doesn't. Boo.)

And to compound the problem, there's no discussion of this on the internet (or at least, not that I could find)!

I did wonder why COW snapshots can't be fixed by just adding more space to the LV (using lvextend, but, actually, the COW space will be required not only when you write to the snapshot destination, but also when you write to the snapshot source. So once your COW area is filled up, any writes to the source LV must necessarily make the snapshot LV invalid, and not easily recoverable.

RobM · Answer 2 · 2016-07-13T05:02:12+08:00

(Not a direct answer, but I hope of use to others battling 100% full snapshots which cause input/output errors)

This happened to me: my snapshot become 100% full, but the file-system within it thought it had loads of space, resulting in input/output errors whenever I ran lvs or any other LVM2 command.

In my case the only option is to delete the snapshot with lvremove, but I couldn't because I had lazily unmounted the snapshot using umount -l. This made it very difficult to track down which processes were using the until-recently-mounted file-system.

I found success by obtaining the logical volume's major + minor device numbers, e.g. 252:10 in the following:

root@hostname:~# lvdisplay

  --- Logical volume ---
  LV Path                /dev/vg00/
  LV Name                snapshot_of_my_origin
  VG Name                vg00
  LV UUID                CWZxOa-depw-k5P4-SqDo-bdFb-h3Np-ukQkmM
  LV Write Access        read/write
  LV Creation host, time cz3328jlkj, 2016-07-12 13:47:31 +0100
  LV snapshot status     active destination for my_origin
  LV Status              available
  # open                 1
  LV Size                150.00 GiB
  Current LE             38400
  COW-table size         50.00 GiB
  COW-table LE           12800
  Allocated to snapshot  0.03%
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:10

If you run lsof as root, without arguments, you will get a full listing of the open files on the system. Filter for your major + minor block device numbers, separated by a comma, not a colon as above, and you may find the process using it:

root@hostname:~# lsof | sed -ne '1p; / 252,10 /p'
COMMAND     PID   TID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
bash       2055           upr473  cwd       DIR             252,10      4096          2 /

Note that the NAME is /, because it has been lazily unmounted, lsof cannot resolve its original path name.

Kill this process, 2055 in this example, and try lvremove et al again.

LVM is reporting I/O errors, but the disk reports no problems. Argh

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?