I am trying to find the bottleneck in the rebuilding of a software raid6.
## Pause rebuilding when measuring raw I/O performance
# echo 1 > /proc/sys/dev/raid/speed_limit_min
# echo 1 > /proc/sys/dev/raid/speed_limit_max
## Drop caches so that does not interfere with measuring
# sync ; echo 3 | tee /proc/sys/vm/drop_caches >/dev/null
# time parallel -j0 "dd if=/dev/{} bs=256k count=4000 | cat >/dev/null" ::: sdbd sdbc sdbf sdbm sdbl sdbk sdbe sdbj sdbh sdbg
4000+0 records in
4000+0 records out
1048576000 bytes (1.0 GB) copied, 7.30336 s, 144 MB/s
[... similar for each disk ...]
# time parallel -j0 "dd if=/dev/{} skip=15000000 bs=256k count=4000 | cat >/dev/null" ::: sdbd sdbc sdbf sdbm sdbl sdbk sdbe sdbj sdbh sdbg
4000+0 records in
4000+0 records out
1048576000 bytes (1.0 GB) copied, 12.7991 s, 81.9 MB/s
[... similar for each disk ...]
So we can read sequentially at 140 MB/s in the outer tracks and 82 MB/s in the inner tracks on all the drives simultaneously. Sequential write performance is similar.
This would lead me to expect a rebuild speed of 82 MB/s or more.
# echo 800000 > /proc/sys/dev/raid/speed_limit_min
# echo 800000 > /proc/sys/dev/raid/speed_limit_max
# cat /proc/mdstat
md2 : active raid6 sdbd[10](S) sdbc[9] sdbf[0] sdbm[8] sdbl[7] sdbk[6] sdbe[11] sdbj[4] sdbi[3](F) sdbh[2] sdbg[1]
27349121408 blocks super 1.2 level 6, 128k chunk, algorithm 2 [9/8] [UUU_UUUUU]
[=========>...........] recovery = 47.3% (1849905884/3907017344) finish=855.9min speed=40054K/sec
But we only get 40 MB/s. And often this drops to 30 MB/s.
# iostat -dkx 1
sdbc 0.00 8023.00 0.00 329.00 0.00 33408.00 203.09 0.70 2.12 1.06 34.80
sdbd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdbe 13.00 0.00 8334.00 0.00 33388.00 0.00 8.01 0.65 0.08 0.06 47.20
sdbf 0.00 0.00 8348.00 0.00 33388.00 0.00 8.00 0.58 0.07 0.06 48.00
sdbg 16.00 0.00 8331.00 0.00 33388.00 0.00 8.02 0.71 0.09 0.06 48.80
sdbh 961.00 0.00 8314.00 0.00 37100.00 0.00 8.92 0.93 0.11 0.07 54.80
sdbj 70.00 0.00 8276.00 0.00 33384.00 0.00 8.07 0.78 0.10 0.06 48.40
sdbk 124.00 0.00 8221.00 0.00 33380.00 0.00 8.12 0.88 0.11 0.06 47.20
sdbl 83.00 0.00 8262.00 0.00 33380.00 0.00 8.08 0.96 0.12 0.06 47.60
sdbm 0.00 0.00 8344.00 0.00 33376.00 0.00 8.00 0.56 0.07 0.06 47.60
iostat
says the disks are not 100% busy (but only 40-50%). This fits with the hypothesis that the max is around 80 MB/s.
Since this is software raid the limiting factor could be CPU. top
says:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
38520 root 20 0 0 0 0 R 64 0.0 2947:50 md2_raid6
6117 root 20 0 0 0 0 D 53 0.0 473:25.96 md2_resync
So md2_raid6
and md2_resync
are clearly busy taking up 64% and 53% of a CPU respectively, but not near 100%.
The chunk size (128k) of the RAID was chosen after measuring which chunksize gave the least CPU penalty.
If this speed is normal: What is the limiting factor? Can I measure that?
If this speed is not normal: How can I find the limiting factor? Can I change that?
I don't remember exactly the speeds I had when I migrated to 6 disk RAID 6 from 4 disk RAID 5, but they were similar (4TB usable array, 24h rebuild, so around 45MB/s).
You have to remember that even the
speed_limit_min
will give some priority to applications that try to use the array. As such, the mechanism used to detect activity may require a 50% load on the disks to detect it and still have the ability to serve the IO requests. Did you try unmounting the partition?To check for bottlenecks you'll have to trace the kernel (for example, using Linux Tracing Toolkit
lttng
, or System Tap). It's not easy and will take lots of time so unless you have to rebuild the arrays on few computers, It's probably not worth it. As for changing it: I'm sure such patches to Linux kernel will be welcome :)I would not expect a Raid6 recovery operation to be of sequential nature since it usually needs to recover checksums and data blocks from n-1 drives which are embedded between data blocks on these drives.
In addition to this I would expect a somewhat sequential operation (=not full parallel) like:
at least 5. is a synchronisation point so duration(1..4) is at least duration(slowest(1..4)). How well it peforms is determined by the level of parallelization of any involved layer (md, driver, controller (ncq etc)).
I would never expect a rebuild rate of a raid6 anywhere near the sequential read/write times of the single disks.
For comparison: our PS6000 Equallogic arrays (16x1TB) take around 32 hours under moderate load to rebuild a failed disk.