We have an in-house "compute farm" with about 100 CentOS (free re-distribution of RHEL) 5.7 and 6.5 x86_64 servers. (We are in the process of upgrading all the 5.7 boxes to 6.5.) All these machines do two NFSv4 mounts (with sec=krb5p) to two CentOS 6.5 servers. One NFS server is for user home directories, the other contains various data for user processes.
Randomly, one of the client machines will get into a bad state such that any access to the NFSv4 mount hangs ("ls" for example). This means no one (except root) can login, and all user processes that require access to the shares get stuck. In other words, so far this is non-deterministic and cannot be replicated.
I have very verbose NFS logging enabled in both the clients and servers, but never get any errors. However, when this state is triggered, I do get these kernel trace errors on the client machines:
Mar 25 00:49:48 servername kernel: INFO: task ProcessName:8230 blocked for more than 120 seconds.
Mar 25 00:49:48 servername kernel: Not tainted 2.6.32-431.el6.x86_64 #1
Mar 25 00:49:48 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 00:49:48 servername kernel: ProcessName D 0000000000000000 0 8230 8229 0x00000000
Mar 25 00:49:48 servername kernel: ffff8804792cdb68 0000000000000046 ffff8804792cdae8 ffffffffa0251940
Mar 25 00:49:48 servername kernel: ffff88010cdc8080 ffff8804792cdb18 ffff88010cdc8130 ffff88010ea5c208
Mar 25 00:49:48 servername kernel: ffff88047b011058 ffff8804792cdfd8 000000000000fbc8 ffff88047b011058
Mar 25 00:49:48 servername kernel: Call Trace:
Mar 25 00:49:48 servername kernel: [<ffffffffa0251940>] ? rpc_execute+0x50/0xa0 [sunrpc]
Mar 25 00:49:48 servername kernel: [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
Mar 25 00:49:48 servername kernel: [<ffffffff8111f930>] ? sync_page+0x0/0x50
Mar 25 00:49:48 servername kernel: [<ffffffff815280a3>] io_schedule+0x73/0xc0
Mar 25 00:49:48 servername kernel: [<ffffffff8111f96d>] sync_page+0x3d/0x50
Mar 25 00:49:48 servername kernel: [<ffffffff81528b6f>] __wait_on_bit+0x5f/0x90
Mar 25 00:49:48 servername kernel: [<ffffffff8111fba3>] wait_on_page_bit+0x73/0x80
Mar 25 00:49:48 servername kernel: [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
Mar 25 00:49:48 servername kernel: [<ffffffff81135bf5>] ? pagevec_lookup_tag+0x25/0x40
Mar 25 00:49:48 servername kernel: [<ffffffff8111ffcb>] wait_on_page_writeback_range+0xfb/0x190
Mar 25 00:49:48 servername kernel: [<ffffffff81120198>] filemap_write_and_wait_range+0x78/0x90
Mar 25 00:49:48 servername kernel: [<ffffffff811baa3e>] vfs_fsync_range+0x7e/0x100
Mar 25 00:49:48 servername kernel: [<ffffffff811bab2d>] vfs_fsync+0x1d/0x20
Mar 25 00:49:48 servername kernel: [<ffffffffa02cf8b0>] nfs_file_flush+0x70/0xa0 [nfs]
Mar 25 00:49:48 servername kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Mar 25 00:49:48 servername kernel: [<ffffffff81074e0f>] put_files_struct+0x7f/0xf0
Mar 25 00:49:48 servername kernel: [<ffffffff81074ed3>] exit_files+0x53/0x70
Mar 25 00:49:48 servername kernel: [<ffffffff81076f4d>] do_exit+0x18d/0x870
Mar 25 00:49:48 servername kernel: [<ffffffff81077688>] do_group_exit+0x58/0xd0
Mar 25 00:49:48 servername kernel: [<ffffffff81077717>] sys_exit_group+0x17/0x20
Mar 25 00:49:48 servername kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
At this point, the only reliable way to make the machine usable again is to reboot it. (And even that requires a hard power cycle, since the software reboot hangs when it tries to unmount the NFS filesystems.)
It seems like this problem is correlated with a process that malfunctions and starts writing data like crazy. For example, a segfault that generates a huge core file, or a bug with a tight print loop.
However, I've tried to duplicate this problem in a lab environment with multiple "dd" processes hammering away at the NFS server, but all machines chug along happily.
The problem existed with kernel 2.6.32-431.el6 from CentOS 6.5. At the time the question was posed, this was a fairly old kernel. We looked at the changelog for RHEL/CentOS kernels, and saw lots of NFS-related activity. So we upgraded to (what was) the newest CentOS 6.6 kernel, 2.6.32-504.12.2.el6, and haven't experienced the problem since.