We have a virtual server with an NFS share that is written to by other servers. Today the NFS share became inaccessible. The syslog was full off messages like this:
RPC: fragment too large: 311176
I've seached on google but can't find much information about this. Could someone explain what this means?
A rpc message (and NFS is rpc based service) can be split into multiple frames (chunks). Any RPC server has a limit on frame size as well as a limit on message size. "RPC: fragment too large: " indicates that NFS server got a rpc frame which is bigger than max allowed size. This message can point to a bug in the client code, server code or network issues. Port scans can trigger such situation as well.
In redhat
https://access.redhat.com/solutions/2065483
A problem can occur if the NFS server chooses to lower this value while NFS clients have a share mounted. The way this can occur is if the NFS server is rebooted with less memory. Depending on the change in the amount of memory, the NFS server may select a smaller rsize/wsize maximum.
But NFS clients which already had the NFS share mounted from before the NFS server restarted will not know the maximum allowed rsize/wsize changed. The allowed maximum is communicated to an NFS client when it mounts the NFS share. The NFS client will still think it is permitted to send RPCs using the old, larger size. But the large RPCs get rejected and dropped by the NFS server and generate the RPC: fragment too large do to the new maximum.
In NFS source code
I saw this with an Ubuntu Xenial NFS server, and it was triggered after I had resized it from 8gb of RAM to 4gb of RAM.
Clients that had originally mounted an NFS share from the server had (presumably) auto-negotiated a mount like the following (output from running
mount
) :nfs-server:/backups/something on /backup type nfs4 (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.2,local_lock=none,addr=10.1.1.1)
The fix was to reboot the client, or just unmount/mount the filesystem, at which point it changed the negotiated rsize/wsize down to be like :
nfs-server:/backups/something on /backup type nfs4 (rw,noatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.2,local_lock=none,addr=10.1.1.1)
So I'd assume the problem could also be triggered by setting too large a rsize/wsize on the NFS mount.