I'm working on several Linux VMs whose partitions are mounted on a NetApp NAS. This NAS periodically experiences a very high iowait which causes the VM disks to switch to read-only mode, crash, or be corrupted.
On the VMware KB it is suggested to increase the timeout value as a palliative cure:
echo 180 > /sys/block/sda/device/timeout
What could be the negative effects of setting a very high timeout (1800 or more)? The way I see it, the risk is that the delayed writes accumulate and fill up the I/O write buffer, crashing the system. Therefore this solution might be worse than the issue.
Most writes, being cached in the OS dirty pagecache, are already completed asynchronously. In other word, they often have nothing to do with device timeout.
However, reads and synchronized writes requires immediate attention from the underlying block device, and this is the very reason your filesystem switches to read-only mode (it can not write its journal to disk).
Increasing I/O wait time should have no bad impact, but it is not a silver bullet. For example, a database can go in read-only mode even it the underlying filesystem remain in read-write mode.
Note the default SCSI timeout is already 30 seconds. That is already a fairly long time in computer terms :-P.
IO requests (e.g. async writes) are bounded by
/sys/class/block/$DEV/nr_requests
, and/sys/class/block/$DEV/max_sectors_kb
. In the old single-queue block layer, total memory usage is said to be2*nr_requests*max_sectors_kb
. The factor of 2 is because reads and writes are counted separately. Though you also need to account for requests in the hardware queue, see e.g.cat /sys/class/block/sda/device/queue_depth
. You are generally expected to make sure the maximum hardware queue depth is no larger than half ofnr_requests
.1) It is written that if your IO requests need too much space, you will get out of memory errors. So you could have a look at the above values on your specific system. Usually they are not a problem.
nr_requests
defaults to 128. The default value ofmax_sectors_kb
depends on your kernel version.If you use the new multi-queue block layer (blk-mq), reads and writes are not counted separately. So the "multiply by two" part of the equation goes away, and
nr-requests
defaults to 256 instead. I am not certain how the hardware queue (or queues) is treated inblk-mq
.When the request queue is full, async writes can build up in the page cache until they hit the "dirty limit". Historically the default dirty limit is described as 20% of RAM, although the exact determination is slightly more complex nowadays.
When you hit the dirty limit, you just have to wait. The kernel does not have another hard timeout beyond the SCSI timeout. In that sense, the common documents on this topic, including the VMware KB, are quite sufficient. Although you should search for the specific documentation that applies to your NAS :-P. Different vintages of NAS have been engineered to provide different worst-case timings.
2) That said, if a process has been waiting for disk IO for more than 120 seconds, the kernel will print a "hung task" warning. (Probably. That's the usual default. Except on my version of Fedora Linux, where the kernel seems to have been built without CONFIG_DETECT_HUNG_TEST. Fedora appears to be a weird outlier here).
The hung task message is not a crash, and it does not set the kernel "tainted" flag.
After 10 hung task warnings (or whatever you set
sys.kernel.hung_task_warnings
to), the kernel stops printing them. Thinking about this, in my opinion you should also increase thesysctl
sys.kernel.hung_task_timeout_secs
so that it is above your SCSI timeout, e.g. 480 seconds.3) Individual applications may have their own timeouts. You probably prefer to see an application timeout, rather than have the kernel return an IO error! Filesystem IO errors are commonly considered fatal. The filesystem itself may remount read-only after an IO error, depending on configuration. IO errors in swap devices or memory-mapped files will send the SIGBUS signal to the affected process, which will usually terminate the process.
4) If using
systemd
, services which have a watchdog timer configured could be forcibly restarted. In current versions ofsystemd
, you can see e.g. a timeout of 3 minutes if you runsystemctl show -p WatchdogUSec systemd-udevd
. This was increased four years ago for a different reason; it appears to be just a co-incidence that this matches VMware's suggested SCSI timeout :-). These restarts could generate alarming log noise.systemd
kills the process with SIGABRT, with the idea of getting a core dump to show where the process got stuck. However stuff like udev and even journald is supposed to be quite happy to be restarted nowadays.The main concern would be to make sure that you have not configured a too-short userspace reboot watchdog, e.g.
RuntimeWatchdogSec=
in/etc/systemd-system.conf
. Even if you do not use swap, it would be possible forsystemd
to become blocked by disk IO, by a memory allocation that enters kernel "direct reclaim".