I have an Ubuntu 11.10 VM set up in VMSphere. I'm storing some data on a nfs mount. The VM has been going down frequently. I haven't been able to pin the reason why down, but I think it has to do with this error:
Jan 19 09:53:07 ws-test-moodlearchive-01 kernel: [ 384.523617] nfs4_reclaim_open_state: Lock reclaim failed!
It shows up in /var/log/syslog thousands of time. Most often after cron starts running.
I originally was saving the output of one cron job to a text file stored on the NFS mount. Switching that to the local disk seems to have reduced the number of errors, but it's still showing up.
Google has been been very unhelpful, nothing I found seemed to apply. Didn't find anything even close on this site, or StackOverflow.
So, what does that error mean? And how can I keep it from occurring?
The NFS server I was connecting to was running version 3. I was connecting with version 4. Switching to version 3 seems to have fixed the problem. I no longer see the nfs4_reclaim_open_state error in my syslog.
To make NFS use version 3 when connecting, I added nfsvers=3 to my fstab file. So an entry like this:
Changed to:
I still have not found out exactly what the error message was telling me. But at least I fixed it.
Actually this will not show in NFS3 because this is a NFS4 only code, NFS3 do not have this feature :) NFS3 have a different error recovery and it might be just hiding the problem.
Thist may happen when the NFS4 client get a complete action with some error and try to recover from it. When recovering, this error will show if the NFS tried to reclaim the lock and fail.
There are many reasons for the lock reclaim to fail, since some bug or races in the nfs server, to network problems. If you think this is a issue, you will have to do a tcpdump to catch the NFS traffic (client side preferred) and try to understand the request flow before the error show up, to understand first why some unknown NFS4 action was failed and then what happen during the lock reclaim
So the first thing to check is probably the network, check the cables, switch and port errors, duplicate IPs, bad bounding/LACP, packet lost, etc