Here's a problem which I am at a loss to diagnose:
Our user home directories are served via NFS from an Apple XServe running Mac OS X 10.5.7. Normally they are exported to our default office subnet, "lan". Recently I have been building a new subnet, "farm". The computers on "farm" run the same OS (openSUSE 11.1, and Gentoo) as the ones on "lan", and the software versions are the same.
The problem is that when my users have been using a machine on "farm" for some time (5 minutes, sometimes 30, sometimes a full hour) the NFS mount seems to just hang. Attempting to do an ls
on the directory or anything else (such as a login, etc) that tries to access the user home directory just gets stuck. Mounts to other NFS servers from the "hung" machine seem to work as expected.
There is nothing in the logs of either the client or the server that indicates any problem. The same types of clients work just fine from the default "lan" subnet.
I've tried all sorts of different configurations of the NFS server and client (disabling/enabling kerberos, different mount options) but nothing appears to make any difference.
I'm strongly suspecting some network-level problems between these two subnets, perhaps some mangling by firewall/router (OpenBSD with pf as the packet filter). The connection between the two sets of machines is fairly simple:
x serve --> switch --> router --> switch --> clients
I'm pretty much at a loss as to debugging methods to try next, or what the possible solution may be. Any ideas as to how to approach this problem from this point?
Update:
Still haven't been able to resolve this. I thought I had nipped it in the bud when I disabled scrub
on the internal interfaces, but the problem has manifested itself again. What's strange is that pf seems to still be modifying some packets.
An example conversation, on the farm vlan side:
09:17:39.165860 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472382:2887472382(0) win 5840 <mss 1460,sackOK,timestamp 236992843 0,nop,wscale 6> (DF)
09:17:39.166124 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 43 win 65535 <nop,nop,timestamp 316702204 236992843> (DF)
09:17:54.164490 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472385:2887472385(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.164760 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 1441270809:1441270809(0) ack 43 win 65535 (DF)
09:17:54.164776 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 4243886205:4243886205(0) ack 46 win 0 (DF)
09:17:54.164989 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:57.164066 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236997343 0,nop,wscale 6> (DF)
09:17:57.164330 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 49 win 65535 <nop,nop,timestamp 316702384 236997343> (DF)
09:18:03.163468 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236998843 0,nop,wscale 6> (DF)
09:18:03.163732 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 49 win 65535 <nop,nop,timestamp 316702444 236998843> (DF)
and the same on the lan vlan:
09:17:39.165876 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472382:2887472382(0) win 5840 <mss 1460,sackOK,timestamp 236992843 0,nop,wscale 6> (DF)
09:17:39.166110 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702204 236992843> (DF)
09:17:54.164505 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472385:2887472385(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.164740 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 1:1(0) ack 1 win 65535 (DF)
09:17:54.164745 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: R 2802615397:2802615397(0) ack 4 win 0 (DF)
09:17:54.165003 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236996593 0,nop,wscale 6> (DF)
09:17:54.165239 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702354 236996593,sackOK,eol> (DF)
09:17:55.123665 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702363 236996593,sackOK,eol> (DF)
09:17:57.124839 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702383 236996593,sackOK,eol> (DF)
09:17:57.164082 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236997343 0,nop,wscale 6> (DF)
09:17:57.164316 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702384 236997343> (DF)
09:18:01.126690 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: S 449458819:449458819(0) ack 2887472389 win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 316702423 236997343,sackOK,eol> (DF)
09:18:03.163483 node001.farm.foo.com.769 > barstar.lan.foo.com.nfsd: S 2887472388:2887472388(0) win 5840 <mss 1460,sackOK,timestamp 236998843 0,nop,wscale 6> (DF)
09:18:03.163717 barstar.lan.foo.com.nfsd > node001.farm.foo.com.769: . ack 1 win 65535 <nop,nop,timestamp 316702444 236998843> (DF)
I should also mention that we have other NFS traffic going through this same machine, but from a different NFS server. We've been using that for years and have not had any problems there. Similarly, these XServes have been serving NFS to Linux clients on their own subnet for a long while as well and continue to do so.
Just wanted to update on this in case anyone runs in to the same problem.
Essentially it comes down to the state rules in Pf. By default Pf keeps state, and uses S/SA as a mask. However, it seems that the NFS server implementation on OS X attempts to start a conversation back to the client using a non-standard set of flags. This was causing it to fail because Pf simply dropped the packets. I gathered this by tcpdumping both the lan and farm interfaces. After tweaking the state flags for the rules between the subnets, the connection was established correctly.
However, it seems to Pf continued to filter out some packets due to some other form of internal normalization, and no amount of tweaking the options that I tried managed to get it work.
In the end, I ended up creating another interface on the file server and placing it directly on the farm vlan, bypassing the router altogether.
I haven't used
pf
; but I think it was one of the first statefull filters. Maybe it's keeping account of the 'connections' and dropping them?I'd look for any state-dependent filter rule. In Linux's
iptables
usually the filter starts with abecause that way it won't have to recheck all the relevant rules for each packet after the first one. But since NFS is UDP based and doesn't care about long (even hours) periods of silence, maybe the router is losing the
ESTABLISHED
state, and the new packets aren't valid for a start.check if there's any 'keepalive' parameter to make the client send heartbeat packets after a minute or so of silence. if not, try NFS over TCP. (which does have heartbeat packets).
The first thing to do is ensure that the firewall is actually the culprit.
To do this, set your default block rules to log. On my firewalls, that's two lines at the top of the filter rules:
Wait for the NFS mount to hang again and check your log interface:
If you're seeing these packets blocked at the firewall, please post your pf.conf. If not, we need to start looking beyond the firewall.