My software RAID can write 800 MB/s sustained. I see that happening when cat /proc/meminfo |grep Writeback:
returns > 2 GB. However, most of the time the writeback is round 0.5 GB which gives a performance around 200 MB/s.
There is plenty of data to be written. cat /proc/meminfo |grep Dirty:
says the dirty cache is 90 GB.
As I understand Dirty is what needs to be written, whereas Writeback is what is actively being written to disk. So there may be blocks in Dirty that are located on the disk just next to blocks in Writeback, and these will not be written in the same go.
This can explain why I get much worse performance if Writeback is small as the time spent seeking is much longer that the time spent writing a few extra MB.
So my question is: Can I somehow tell the kernel to move more data from Dirty to Writeback more aggressively and thus increase Writeback?
-- Edit --
This is during low performance:
$ cat /proc/meminfo
MemTotal: 264656352 kB
MemFree: 897080 kB
Buffers: 72 kB
Cached: 233751012 kB
SwapCached: 0 kB
Active: 3825364 kB
Inactive: 230327200 kB
Active(anon): 358120 kB
Inactive(anon): 47536 kB
Active(file): 3467244 kB
Inactive(file): 230279664 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 204799996 kB
SwapFree: 204799996 kB
Dirty: 109921912 kB
Writeback: 391452 kB
AnonPages: 404748 kB
Mapped: 12428 kB
Shmem: 956 kB
Slab: 21974168 kB
SReclaimable: 21206844 kB
SUnreclaim: 767324 kB
KernelStack: 5248 kB
PageTables: 7152 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 337128172 kB
Committed_AS: 555272 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 544436 kB
VmallocChunk: 34124336300 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 149988 kB
DirectMap2M: 17649664 kB
DirectMap1G: 250609664 kB
cat /proc/sys/vm/dirty_background_ratio
1
Lowering dirty_writeback_centisecs only chops up Dirty in even smaller bits.
You didn't give the entire /proc/meminfo output and so I don't know whether there are any tuning you have done beforehand.
Two immediate tunable that you can use are these.
/proc/sys/vm/dirty_background_ratio
The default is 10. Increase it to 30 or 40 and test.
/proc/sys/vm/dirty_writeback_centisecs
The default is 500. Set it to 300 and test.
Please remember these are not absolute values. You have to go through trial and error to find out what suits your environment most.
I just figured these values out based on the description you provided and assuming that is correct.
If you have the kernel-doc package installed, go to sysctl and then open up vm.txt to read about.
The real problem is that the Linux kernel Dirty page flush algorithm does not scale to large memory sizes, so anytime the Dirty page in /proc/meminfo exceeds around 1GB the writeback speed slows down progressively and eventually the /proc/sys/vm/dirty_ratio or /proc/sys/vm/dirty_bytes limit is exceeded and the kernel starts throttling all writes to keep the Dirty pages from growing any further.
To maintain high write speed (in OPs case up to 800Mb/sec, can easily be 2 Gb/sec for a hardware RAID controller with cache) you need to counter intuitively lower the /proc/sys/vm/dirty_bytes and dirty_background_bytes to 256M and 64M respectively
Make sure you do a sync first otherwise the system will freeze on writes for several hours until the Dirty page value in /proc/meminfo drops below the new value in /proc/sys/vm/dirty_bytes. The sync will also take several hours, but at least the system will not be frozen during this time.
Writeback
represents the size of the IO queue.The maximum size of the IO queue can be increased by increasing
nr_requests
(and potentiallymax_sectors_kb
). Given the amount ofDirty
memory you have, I suspect you are hitting this limit.https://www.google.com/search?q=linux+block+queue+nr_requests+OR+max_sectors_kb
In recent kernels, you should also watch out for the effect of
wbt_lat_usec
. You can disable this by writing0
to it, and reset it to the default value by writing-1
.There is also the question of the I/O scheduler. A lot of server advice says to use the
deadline
scheduler, not CFQ. CFQ (and to an extent, BFQ) deliberately "idle" the disk, in an attempt to solicit contiguous sequential I/O from one process at a time.I do not know how you should tune the
md
RAID device v.s. the individual disk devices, sorry.(You could also try measuring the number of queued IO requests.
atopsar -d 1
, orsar -d 1
, oriostat -dx 1
. However the "average queue size" statistic is derived from utilization ("io_ticks"), and this is reported incorrectly since kernel version 5.0. The instantaneous queue size is still accurate. However existing tools tend to only show the average queue size, because that was the more useful value).