I have the following NFS-based storage setup:
Computes nodes are Linux. The NFS servers are Solaris.
A not-so-important user runs a bunch of read intensive jobs on a subset of the compute nodes. As a result, the whole group of compute nodes becomes very slow (ls
blocks for 30 seconds). I was able to track down that the dedicated NFS server is hitting the limit of the san's read throughput.
How to implement quality of service (QoS) limiting the NFS bandwidth to nodes, processes, or users?
I'm not sure NFS can be "hardened" against what amounts to a DDOS from a cluster. If you really need that, using something else to access persistent storage will be easier.
Given your setup, I would suggest doing the "QoS" on the cluster engine level.
Configure a resource "io_heavy" limited to say "10" and make your users request "1" of this for IO-heavy jobs. That way, no more than 10 jobs that are I/O-bound by will run concurrently. Your NFS won't collapse and the rest of the cluster will remain free for CPU-bound tasks.
You should also add scratch disks to the nodes. These can hold temporary data which doesn't really need to go to the NFS. It also helps to provide "reference data" that's commonly used here.
I assume your Solaris NFS servers use ZFS. Fill the servers with as much RAM as you can put into them. Add SSDs to the servers to be used as ZFS Cache disks. Both of these things reduce the traffic on your SAN.
QOS is normally used to give priority to certain types of network streams. Can't you isolate and limit the user's port on the network switch? Or put him/her in a separate VLAN? Or limit the port's data rate to 100MBps?
Other than that I am not aware of any NFS bandwidth limitation by username or MAC address. Maybe your NFS server has options to ensure a more distributed way of serving file requests?
Thinking out of the box: move the read-intensive files closer to the user and run a backup/rsync process to write the updated data back to the NAS?
What kind of read-intensive jobs are these anyway?