Background flushing on Linux happens when either too much written data is pending (adjustable via /proc/sys/vm/dirty_background_ratio) or a timeout for pending writes is reached (/proc/sys/vm/dirty_expire_centisecs). Unless another limit is being hit (/proc/sys/vm/dirty_ratio), more written data may be cached. Further writes will block.
In theory, this should create a background process writing out dirty pages without disturbing other processes. In practice, it does disturb any process doing uncached reading or synchronous writing. Badly. This is because the background flush actually writes at 100% device speed and any other device requests at this time will be delayed (because all queues and write-caches on the road are filled).
Is there a way to limit the amount of requests per second the flushing process performs, or otherwise effectively prioritize other device I/O?
After lots of benchmarking with sysbench, I come to this conclusion:
To survive (performance-wise) a situation where
just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.
Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (
echo 15000000 > dirty_bytes
).This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.
Specifications and benchmarks for comparison:
Tested while
dd
'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).
dd
call:for sysbench, the test_file.0 was prepared to be unsparse with:
sysbench call for 10 threads:
sysbench call for one thread:
Smaller block sizes showed even more drastic numbers.
--file-block-size=4096 with 1 GB dirty_bytes:
--file-block-size=4096 with 15 MB dirty_bytes:
--file-block-size=4096 with 15 MB dirty_bytes on idle system:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Test-System:
In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.
Even though tuning kernel parameters stopped the problem, it's actually possible your performance issues were the result of a bug on the Adaptec 5405Z controller that was fixed in a Feb 1, 2012 firmware update. The release notes say "Fixed an issue where the firmware could hang during high I/O stress." Perhaps spreading out the I/O as you did was enough to prevent this bug from being triggered, but that's just a guess.
Here are the release notes: http://download.adaptec.com/pdfs/readme/relnotes_arc_fw-b18937_asm-18837.pdf
Even if this wasn't the case for your particular situation, I figured this could benefit users who come across this post in the future. We saw some messages like the following in our dmesg output which eventually led us to the firmware update:
Here are the model numbers of the Adaptec RAID controllers which are listed in the release notes for the firmware that has the high I/O hang fix: 2045, 2405, 2405Q, 2805, 5085, 5405, 5405Z, 5445, 5445Z, 5805, 5805Q, 5805Z, 5805ZQ, 51245, 51645, 52445.
A kernel which includes "WBT":
WBT does not require switching to the new blk-mq block layer. That said, it does not work with the CFQ or BFQ I/O schedulers. You can use WBT with the deadline / mq-deadline / noop / none schedulers. I believe it also works with the new "kyber" I/O scheduler.
As well as scaling the queue size to control latency, the WBT code limits the number of background writeback requests as a proportion of the calculated queue limit.
The runtime configuration is in
/sys/class/block/*/queue/wbt_lat_usec
.The build configuration options to look for are
Your problem statement is confirmed 100% by the author of WBT - well done :-).
What is your average for Dirty in /proc/meminfo? This should not normally exceed your /proc/sys/vm/dirty_ratio. On a dedicated file server I have dirty_ratio set to a very high percentage of memory (90), as I will never exceed it. Your dirty_ration is too low, when you hit it, everything craps out, raise it.