I have a Xen PV guest, running Ubuntu 10.04. I do not run the underlying host. Kernel is the stock one provided by Ubuntu:
Linux nephos 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
The machine servers as a LAMP web/DB server with a bunch of perl web applications we've developed in-house. Since we went live and let our users loose on the machine on Monday morning, reliably once a day it has gone into a state where we can't reboot it from the command line, CGI scripts become unresponsive, ping times shoot up, and even commands like ls
fail in certain directories (possibly ones to which writes are pending).
top
shows a number of processes in state D
, mostly named fleet.cgi
or doc.pl
, which are our applications. Attempts to kill
or kill -9
these processes silently fail. sudo reboot
returns, claiming the machine is about to go down, but never sends the broadcast message to other shell users that it is about to do so, nor does it reboot the machine.
When the machine begins to lock up, lines like the following appear in syslog:
Dec 14 12:05:45 nephos kernel: [71040.150212] INFO: task mysqld:2708 blocked for more than 120 seconds.
Dec 14 12:05:45 nephos kernel: [71040.150234] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 12:05:45 nephos kernel: [71040.150247] mysqld D ffff880002d4dbc0 0 2708 1 0x00000000
Dec 14 12:05:45 nephos kernel: [71040.150256] ffff8800fa5e9918 0000000000000286 0000000000015bc0 0000000000015bc0
Dec 14 12:05:45 nephos kernel: [71040.150264] ffff8800ec5883c0 ffff8800fa5e9fd8 0000000000015bc0 ffff8800ec588000
Dec 14 12:05:45 nephos kernel: [71040.150272] 0000000000015bc0 ffff8800fa5e9fd8 0000000000015bc0 ffff8800ec5883c0
Dec 14 12:05:45 nephos kernel: [71040.150280] Call Trace:
Dec 14 12:05:45 nephos kernel: [71040.150309] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
Dec 14 12:05:45 nephos kernel: [71040.150320] [<ffffffff815555c7>] io_schedule+0x47/0x70
Dec 14 12:05:45 nephos kernel: [71040.150325] [<ffffffff8116d215>] sync_buffer+0x45/0x50
Dec 14 12:05:45 nephos kernel: [71040.150330] [<ffffffff81555a9a>] __wait_on_bit_lock+0x5a/0xc0
Dec 14 12:05:45 nephos kernel: [71040.150334] [<ffffffff8116d1d0>] ? sync_buffer+0x0/0x50
Dec 14 12:05:45 nephos kernel: [71040.150339] [<ffffffff81555b78>] out_of_line_wait_on_bit_lock+0x78/0x90
Dec 14 12:05:45 nephos kernel: [71040.150347] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40
Dec 14 12:05:45 nephos kernel: [71040.150353] [<ffffffff8116c1e7>] ? __find_get_block_slow+0xb7/0x130
Dec 14 12:05:45 nephos kernel: [71040.150357] [<ffffffff8116d396>] __lock_buffer+0x36/0x40
Dec 14 12:05:45 nephos kernel: [71040.150365] [<ffffffff81212164>] do_get_write_access+0x554/0x5d0
Dec 14 12:05:45 nephos kernel: [71040.150369] [<ffffffff8116cb57>] ? __getblk+0x27/0x50
Dec 14 12:05:45 nephos kernel: [71040.150374] [<ffffffff81212371>] journal_get_write_access+0x31/0x50
Dec 14 12:05:45 nephos kernel: [71040.150381] [<ffffffff811c5f9d>] __ext3_journal_get_write_access+0x2d/0x60
Dec 14 12:05:45 nephos kernel: [71040.150386] [<ffffffff811b7c7b>] ext3_reserve_inode_write+0x7b/0xa0
Dec 14 12:05:45 nephos kernel: [71040.150392] [<ffffffff8155748e>] ? _spin_unlock_irqrestore+0x1e/0x30
Dec 14 12:05:45 nephos kernel: [71040.150396] [<ffffffff811b7ccb>] ext3_mark_inode_dirty+0x2b/0x50
Dec 14 12:05:45 nephos kernel: [71040.150401] [<ffffffff811b7e71>] ext3_dirty_inode+0x61/0xa0
Dec 14 12:05:45 nephos kernel: [71040.150406] [<ffffffff81165c22>] __mark_inode_dirty+0x42/0x1e0
Dec 14 12:05:45 nephos kernel: [71040.150412] [<ffffffff81159f8b>] file_update_time+0xfb/0x180
Dec 14 12:05:45 nephos kernel: [71040.150422] [<ffffffff810f5300>] __generic_file_aio_write+0x210/0x470
Dec 14 12:05:45 nephos kernel: [71040.150430] [<ffffffff8114f49d>] ? __link_path_walk+0xad/0xf80
Dec 14 12:05:45 nephos kernel: [71040.150435] [<ffffffff810f55cf>] generic_file_aio_write+0x6f/0xe0
Dec 14 12:05:45 nephos kernel: [71040.150441] [<ffffffff8114311a>] do_sync_write+0xfa/0x140
Dec 14 12:05:45 nephos kernel: [71040.150446] [<ffffffff81084fa0>] ? autoremove_wake_function+0x0/0x40
Dec 14 12:05:45 nephos kernel: [71040.150453] [<ffffffff8100f392>] ? check_events+0x12/0x20
Dec 14 12:05:45 nephos kernel: [71040.150461] [<ffffffff81250946>] ? security_file_permission+0x16/0x20
Dec 14 12:05:45 nephos kernel: [71040.150466] [<ffffffff81143418>] vfs_write+0xb8/0x1a0
Dec 14 12:05:45 nephos kernel: [71040.150470] [<ffffffff81143db2>] sys_pwrite64+0x82/0xa0
Dec 14 12:05:45 nephos kernel: [71040.150477] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
I've installed the Ubuntu package linux-virtual
to ensure I have a suitable kernel installed, but its dependencies have been satisfied by the kernel I currently have. I'm at a bit of a loss where else to look here really. Under normal load nothing untoward appears in iotop
, but equally I'm not aware of any sudden increase in usage that triggers the fault -- that said, the machine has been running these applications with just 1/2 test users for weeks, and is only failing now a dozen people are hitting them all day.
Do I just need a machine with better IO capabilities (or to reduce my apps' need for same) or is this something I can approach with a bit of tuning?
Updated 15/12/2010 23:41: If it is helpful (and I suspect it's a crucial detail) the guest is running under paravirtualisation.
From Red Hat bugzilla, further indication that turning irqbalance off may be the workaround, and that a fix is in 2.6.32.22
https://bugzilla.redhat.com/show_bug.cgi?id=550724#c81 (comments 81 through 91)
comment 91 links to the release notes for 2.6.32.22 (search inline for xen)