I have out-of-the box Debian Lenny with non-custom kernel 2.6.26-2-amd64. Brand new server that is used to 5% of it's potential, CPU and Disk-wise. Meaning it probably not crashing because of overload.
every few days it freezes with hundreds of these messages in console log:
: [284847.828428] INFO: task apache2:12473 blocked for more than 120 seconds.
: [284847.868468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [284847.912759] apache2 D ffff8101bc6b7ab0 0 12473 14358
: [284847.912763] ffff810160d5bc50 0000000000000082 ffff8101c0002e40 0000000000000000
: [284847.912766] ffff8101a7c42950 ffff810327d92810 ffff8101a7c42bd8 0000000400000044
: [284847.912770] ffff8101c0002e40 00000000000612d0 0000000000000000 00000040000612d0
: [284847.912773] Call Trace:
: [284847.912786] [<ffffffff80429b0d>] __mutex_lock_slowpath+0x64/0x9b
: [284847.912790] [<ffffffff80429972>] mutex_lock+0xa/0xb
: [284847.912794] [<ffffffff802a20b9>] do_lookup+0x82/0x1c1
: [284847.912800] [<ffffffff802a4271>] __link_path_walk+0x87a/0xd19
: [284847.912805] [<ffffffff80295844>] kmem_getpages+0x96/0x15f
: [284847.912808] [<ffffffff80295fb7>] ____cache_alloc_node+0x6d/0x106
: [284847.912814] [<ffffffff802a4756>] path_walk+0x46/0x8b
: [284847.912819] [<ffffffff802a4a82>] do_path_lookup+0x158/0x1cf
: [284847.912822] [<ffffffff802a3879>] getname+0x140/0x1a7
: [284847.912827] [<ffffffff802a53f1>] __user_walk_fd+0x37/0x4c
: [284847.912831] [<ffffffff8029e381>] vfs_lstat_fd+0x18/0x47
: [284847.912840] [<ffffffff8029e3c9>] sys_newlstat+0x19/0x31
: [284847.912848] [<ffffffff8020beda>] system_call_after_swapgs+0x8a/0x8f
Almost all traces has __mutex_lock_slowpath as top-level.
Only some has different trace:
: [284847.737386] INFO: task apache2:12472 blocked for more than 120 seconds.
: [284847.777551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [284847.824881] apache2 D ffff8101bc6b7ab0 0 12472 14358
: [284847.824886] ffff8101b9cc1c50 0000000000000086 ffffffffa0131e0a 0000000000000002
: [284847.824889] ffff8102e7454300 ffff810324c6cad0 ffff8102e7454588 0000000000000000
: [284847.824893] 0000000000000001 0000000000000296 0000000000000003 ffff8101b9cc1c58
: [284847.824896] Call Trace:
: [284847.828403] [<ffffffffa0131e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
: [284847.828412] [<ffffffff80429b0d>] __mutex_lock_slowpath+0x64/0x9b
: [284847.828418] [<ffffffff80429972>] mutex_lock+0xa/0xb
: [284847.828421] [<ffffffff802a20b9>] do_lookup+0x82/0x1c1
: [284847.828427] [<ffffffff802a4271>] __link_path_walk+0x87a/0xd19
: [284847.828428] [<ffffffff80271296>] find_lock_page+0x1f/0x8a
: [284847.828428] [<ffffffff80273182>] filemap_fault+0x1c2/0x33c
: [284847.828428] [<ffffffff802a4756>] path_walk+0x46/0x8b
: [284847.828428] [<ffffffff802a4a82>] do_path_lookup+0x158/0x1cf
: [284847.828428] [<ffffffff802a3879>] getname+0x140/0x1a7
: [284847.828428] [<ffffffff802a53f1>] __user_walk_fd+0x37/0x4c
: [284847.828428] [<ffffffff8029e381>] vfs_lstat_fd+0x18/0x47
: [284847.828428] [<ffffffff8029e3c9>] sys_newlstat+0x19/0x31
: [284847.828428] [<ffffffff8020beda>] system_call_after_swapgs+0x8a/0x8f
kernel: [1912668.466347] INFO: task apache2:17984 blocked for more than 120 seconds.
[1912668.507035] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [1912668.555165] apache2 D ffff8101c5637ba0 0 17984 17282
: [1912668.596752] ffff810166a7dd30 0000000000000086 0000000000000000 ffff810166a7dcd8
: [1912668.643341] ffff8101c563c880 ffff81024505f000 0000000000000002 ffff810166a7dd68
: [1912668.699566] 0000000000000086 00000000000cb1a0 0000000000000000 ffff81017f344d60
: [1912668.744773] Call Trace:
: [1912668.761754] [<ffffffff8022a3ed>] pick_next_task_fair+0x6e/0x7a
: [1912668.829311] [<ffffffff802be0e2>] bio_alloc_bioset+0x89/0xd9
: [1912668.861930] [<ffffffff8024ac3a>] getnstimeofday+0x39/0x98
: [1912668.897005] [<ffffffff802710f6>] sync_page+0x0/0x41
: [1912668.927868] [<ffffffff80429487>] io_schedule+0x5c/0x9e
: [1912668.960286] [<ffffffff80271132>] sync_page+0x3c/0x41
: [1912668.991756] [<ffffffff804295fa>] __wait_on_bit_lock+0x36/0x66
: [1912669.031757] [<ffffffff802710e3>] __lock_page+0x5e/0x64
: [1912669.064191] [<ffffffff802461d3>] wake_bit_function+0x0/0x23
: [1912669.100100] [<ffffffff80281bc5>] handle_mm_fault+0x5e4/0x8de
: [1912669.134531] [<ffffffff802461a5>] autoremove_wake_function+0x0/0x2e
: [1912669.174623] [<ffffffff802aa108>] fcntl_setlk+0x1cf/0x291
: [1912669.210623] [<ffffffff802461a5>] autoremove_wake_function+0x0/0x2e
: [1912669.246923] [<ffffffff802a677f>] sys_fcntl+0x280/0x2f7
After googling for "mutex_lock_slowpath" I can only find the Kernel mailing list discussions that this issue was introduced in some commit. Wthout reference to verison. Discussions as recent as Jan 25, 2011. The Kernel I am using is form Debian Lenny, year ago.
What should I do? Is this bug even fixed in kernel? if it's such obvious bug why it happens so rarely?
- Should I download latest kernel from kernel.org and upgrade?
- Should I use Debian backports to install new "Approved" kernel?
Am I missing something? What to do?
Are you using an SSD drive by any chance?
I've seen these errors on my Ubuntu 10.10 system. What would happen was that there would be a SATA glitch that would completely mess up the disk subsystem. Subsequent attempts at disk I/O would result in 120-second timeouts similar to yours (the stack trace varied.)
I've documented the original problem here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/707583
The (somewhat lame) question I asked about the timeouts is here: Mystery stack traces in /var/log/messages
I was using a P55-based motherboard and Crucial SSD drives. However, I've seen this reported with other chipsets, other makes of SSD drives, and other Linux kernels.
As far as I can tell, the only thing in common was the use of SSD drives.
I saw that problem in one or two of my servers some time ago (Ubuntu 8.0.4 LTS). I would recommend the following: