I have a machine with an 8 channel LSI SAS3008 controller chip, and individual drive testing shows I can write to any disk or all disks at between 174 MB/sec and 193 MB/sec with a sustained write speed:
This is the output from the command dd if=/dev/zero of=/dev/mapper/mpath?p1 bs=1G count=100 oflag=direct
run in parallel to all 12 disks:
107374182400 bytes (107 GB) copied, 556.306 s, 193 MB/s
107374182400 bytes (107 GB) copied, 566.816 s, 189 MB/s
107374182400 bytes (107 GB) copied, 568.681 s, 189 MB/s
107374182400 bytes (107 GB) copied, 578.327 s, 186 MB/s
107374182400 bytes (107 GB) copied, 586.444 s, 183 MB/s
107374182400 bytes (107 GB) copied, 590.193 s, 182 MB/s
107374182400 bytes (107 GB) copied, 592.721 s, 181 MB/s
107374182400 bytes (107 GB) copied, 598.646 s, 179 MB/s
107374182400 bytes (107 GB) copied, 602.277 s, 178 MB/s
107374182400 bytes (107 GB) copied, 604.951 s, 177 MB/s
107374182400 bytes (107 GB) copied, 605.44 s, 177 MB/s
However, when I put these disks together as a software raid 10 device, I get around 500 MB/sec write speed. I expected to get about double that, since there is no penalty for accessing these disks at the same time.
I did notice the md10_raid10 process which I assume does the software raid itself is nearing 80%, and one core is always at 100% wait time, and 0% idle. Which core that is changes, however.
Additionally, the performance drops even further when using the buffer cache to write to the mounted EXT4 filesystem rather than using oflag=direct to bypass the cache. The disks report 25% busy (according to munin monitoring) but the disks are clearly not running hot, but I worry the md10 device itself may be.
Any suggestions on where to go next on this? I am attempting a hardware raid 10 config to compare, although I can only build a 10 disk unit it seems -- that said, I hope to get 900 MB/sec writes sustained. I'll update this question as I discover more.
Edit 1:
If I use put a dd
command in a tight loop writing to an ext4 partition mounted on that device, and I do not use the buffer cache (oflag=direct) I can get upwards of 950 MB/sec at peak and 855 MB/sec sustained with some alterations to the mount flags.
If I also read with iflag=direct at the same tim, I can get 480 MB/sec writes and 750 MB/sec reads now.
If I write without oflag=direct, thus using the buffer cache, I get 230 MB/sec writes and 1.2 MB/sec reads, but the machine seems to be very sluggish.
So, the question is, why would using the buffer cache so seriously affect performance? I have tried various disk queueing strategies including using 'noop' at the drive level and putting 'deadline' or 'cfq' on the appropriate multi path dm device, or deadline on all, or none on the dm and deadline on the backing drive. It seems like the backing drive should have none, and the multi path device should be the one I want, but this affects performance not at all, at least in the buffer cache case.
Edit:
Your
dd oflag=direct
observations might be due to power management issues. Use PowerTOP to see if your CPU's C-states are switched too often above C1 under write load. If they are, try tweaking PM to ensure the CPU is not going to sleep and re-run the benchmarks. Refer to your distro's documentation on how to do that - in most cases this will be theintel_idle.max_cstate=0
kernel bootline parameter, but YMMV.The vast difference in performance between an
O_DIRECT
write and a buffered write might be due to:obsoleted answer:
This looks very much like a bottleneck caused by the single thread in
md
.Reasoning
dd
run is showing 170MB/s+ per drive, so the path is not restricted by the connecting PCIe bandwidthWhile patches for multithreaded RAID5 checksum calculation have been committed to
mdraid
in 2013, I cannot find anything about similar RAID1 / RAID10 enhancements, so they might simply not be there.Things to try
dd
, just to see if it changes anythingFWIW, you rarely (if ever) will see write performance peak out (especially with a non-CoW-filesystem) on bandwidth with mechanic storage media. Most of the time, you will be restricted by seek times, so peak bandwidth should not be of great concern, as long as it meets your minimum requirements.
1 if you do ZFS, you should refine your testing method as writing all-zero blocks to a ZFS dataset might be arbitrary fast. Zeros are not written to disks but just linked to the all-zero-block if compression is enabled for the data set.