I have NFS shared among 30 cluster nodes. The nodes are Debian 5 and 6. The NFS server is OpenSolaris 2009. We have good hardware and a 20Gbit Infiniband network.
On the cluster nodes, fs operations are snappy but not when it comes to:
- Mutt
- Sqlite3
- An R lib. E.g.
Rscript <(echo "library(GOstats)")
They all get stuck for a few minutes after the following system calls:
fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=1}
orfcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}
What could be the cause? How to diagnose and fix?
Would switching the NFS server to OpenIndiana oi_148 fix?
Those system calls acquire a lock on a file. Perhaps another process currently has a lock and your stuck processes are waiting for that lock to be released. There are some troubleshooting tips (for an older version of Solaris, but they may still be helpful) in chapter 11 of O'Reilly's Managing NFS and NIS, 2nd Edition.
See if the NFS lock service is running on the server.
Upgraded to latest OpenIndiana. The problem disappeared.