We have a load-balanced web app that serves some images off an NFS mounted drive.
When the NFS server goes down, it ends up bringing all of the web instances down.
Currently the volume is mounted with:
ip:/path/to/images /docroot/images nfs soft,intr,rw,rsize=32768,wsize=32768 0 0
I ran a siege test against a selection of images that live on this volume, and when it went down requests ended up timing out based on the apache Timeout value (which was set to 600 sec for this test).
I changed the mount options to:
bg,soft,intr,rw,rsize=32768,wsize=32768,timeo=5,retrans=2,actimeo=60,retry=15
And this was better, but still took too long to fail: The first set of request timed out in about 30 seconds, but the next set took anywhere from 180 to 300s.
I know the long-term solution is to move these to S3, but is it possible to reduce this to under 5-10s without affecting performance?
Soft rw mounts can "cause silent data corruption in certain cases." Consider using a ro mount. Assuming Linux is the OS, the NFS man page lists the mount options you can change. Without testing, given a TCP soft mount (read only)
timeo=1
andretrans=3
would cause the operation to fail in 6 seconds. ("The NFS client performs linear backoff: After each retransmission the timeout is increased bytimeo
...")