Imagine if you will you have a 2 node Red Hat NFS Cluster; each node is RHEL5.4 64bit and they share a SAN LUN for the data. The primary interface on each server is HA failover bonded (bond0, eth0+eth1) and there is a standard floating cluster resource IP for NFS. The cluster configuration is set up with standard Red Hat tools and NFS has static ports defined in /etc/sysconfig/nfs in order to work through a firewall. So far so good, right? Very by the book, best practices – nothing funky or strange used in the server or cluster setup.
The core of the problem is when the clients are using TCP to mount the exported NFSv4 share; on a cluster service relocate to the other node the newly-passive node retains a 2049/tcp (nfs daemon) ESTABLISHED connection using the now-missing cluster IP to the clients even though that’s technically impossible (as far as I’m aware). The “solution” was to move to using UDP when mounting from the clients as we were unable to figure out what was happening (and more importantly how to fix it). Any clues as to why are welcome, details below.
Cluster IP: 1.1.1.10
Client IP: 2.2.2.100
Starting out, NFS service is running on node-A, node-A has the cluster IP aliased as bond0:0 and the SAN mounted. The NFS client is connected via NFSv4 TCP to the cluster IP and things are working just fine. In our netstat on node-A we see:
1.1.1.10:2049 2.2.2.2.100:915 ESTABLISHED
Everything is as it should be. On node-A run a standard ‘clusvcadm -r nfs-svc -m node-B‘ command to move NFS over to node-B; in both syslogs of node-A and node-B you see the proper messages – NFS being stopped, IP being released/moved, SAN unmounted/mounted and so forth. On the NFS client you see a few syslog messages about the NFS server not responding, then it comes back OK and everything is fine. Basically, NFS relocate to node-B works fine.
However, back on node-A which no longer owns the cluster IP 1.1.1.10 you still see in the netstat a connection on 2049! A quick ‘rcpinfo -p’ confirms it’s still nfsd on that port.
1.1.1.10:2049 2.2.2.2.100:915 ESTABLISHED
Of course on node-B you see the same thing as that’s correct. The 10 million dollar question is why is it still showing up on node-A? As soon as the IP went away that should have gotten flushed… if you simply restart nfsd the connection state on node-A turns to FIN_WAIT1 and it eventually times out. The cluster IP does not show up as an interface on node-A any longer to be clear, just in netstat.
And here is where it becomes important – if this TCP phantom 2049 connection is still on node-A and you now relocate the NFS service back to node-A (so it gets that cluster IP again), all clients stall and die with the NFS mount whether or not that phantom connection is in ESTABLISHED or FIN_WAIT1 state. Only when that phantom connection finally disappears from node-A can the NFS clients regain their NFS mount – this is on the order of 5 to 15 minutes.
We tested this back and forth a number of times, ensuring it wasn’t firewall related and it was repeatable as a problem and not just some fluke. At the end of many hours the only workable solution was to move the clients to UDP and avoid the problem completely. I really want to know what’s broken and how to fix it.
Use
netstat -p
to figure out the PID of the process that's listening (or, well, you said it was nfsd, so find it's PID out ofps
) and then do anstrace
on it and you can maybe figure out what's going on with it. Or maybe you can do thestrace
on it before you do theclusvcadm
command and see if it gets any signals or something (it could maybe be hanging in a signal handler or waiting for some system call to return...).If worst comes to worst you could build a debug version of nfsd and run it under gdb and do the clusvcadm and see exactly what it's doing...
I was under the impression that with NFS over TCP you cannot go from A->B->A in a short amount of time. See, e.g., http://www.spinics.net/lists/cluster/msg08758.html