I'm trying to optimize a storage setup on some Sun hardware with Linux. Any thoughts would be greatly appreciated.
We have the following hardware:
- Sun Blade X6270
- 2* LSISAS1068E SAS controllers
- 2* Sun J4400 JBODs with 1 TB disks (24 disks per JBOD)
- Fedora Core 12
- 2.6.33 release kernel from FC13 (also tried with latest 2.6.31 kernel from FC12, same results)
Here's the datasheet for the SAS hardware:
http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf
It's using PCI Express 1.0a, 8x lanes. With a bandwidth of 250 MB/sec per lane, we should be able to do 2000 MB/sec per SAS controller.
Each controller can do 3 Gb/sec per port and has two 4 port PHYs. We connect both PHYs from a controller to a JBOD. So between the JBOD and the controller we have 2 PHYs * 4 SAS ports * 3 Gb/sec = 24 Gb/sec of bandwidth, which is more than the PCI Express bandwidth.
With write caching enabled and when doing big writes, each disk can sustain about 80 MB/sec (near the start of the disk). With 24 disks, that means we should be able to do 1920 MB/sec per JBOD.
multipath { rr_min_io 100 uid 0 path_grouping_policy multibus failback manual path_selector "round-robin 0" rr_weight priorities alias somealias no_path_retry queue mode 0644 gid 0 wwid somewwid }
I tried values of 50, 100, 1000 for rr_min_io, but it doesn't seem to make much difference.
Along with varying rr_min_io I tried adding some delay between starting the dd's to prevent all of them writing over the same PHY at the same time, but this didn't make any difference, so I think the I/O's are getting properly spread out.
According to /proc/interrupts, the SAS controllers are using a "IR-IO-APIC-fasteoi" interrupt scheme. For some reason only core #0 in the machine is handling these interrupts. I can improve performance slightly by assigning a separate core to handle the interrupts for each SAS controller:
echo 2 > /proc/irq/24/smp_affinity echo 4 > /proc/irq/26/smp_affinity
Using dd to write to the disk generates "Function call interrupts" (no idea what these are), which are handled by core #4, so I keep other processes off this core too.
I run 48 dd's (one for each disk), assigning them to cores not dealing with interrupts like so:
taskset -c somecore dd if=/dev/zero of=/dev/mapper/mpathx oflag=direct bs=128M
oflag=direct prevents any kind of buffer cache from getting involved.
None of my cores seem maxed out. The cores dealing with interrupts are mostly idle and all the other cores are waiting on I/O as one would expect.
Cpu0 : 0.0%us, 1.0%sy, 0.0%ni, 91.2%id, 7.5%wa, 0.0%hi, 0.2%si, 0.0%st Cpu1 : 0.0%us, 0.8%sy, 0.0%ni, 93.0%id, 0.2%wa, 0.0%hi, 6.0%si, 0.0%st Cpu2 : 0.0%us, 0.6%sy, 0.0%ni, 94.4%id, 0.1%wa, 0.0%hi, 4.8%si, 0.0%st Cpu3 : 0.0%us, 7.5%sy, 0.0%ni, 36.3%id, 56.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 1.3%sy, 0.0%ni, 85.7%id, 4.9%wa, 0.0%hi, 8.1%si, 0.0%st Cpu5 : 0.1%us, 5.5%sy, 0.0%ni, 36.2%id, 58.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 5.0%sy, 0.0%ni, 36.3%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 5.1%sy, 0.0%ni, 36.3%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.1%us, 8.3%sy, 0.0%ni, 27.2%id, 64.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.1%us, 7.9%sy, 0.0%ni, 36.2%id, 55.8%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 7.8%sy, 0.0%ni, 36.2%id, 56.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 7.3%sy, 0.0%ni, 36.3%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 5.6%sy, 0.0%ni, 33.1%id, 61.2%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 0.1%us, 5.3%sy, 0.0%ni, 36.1%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 4.9%sy, 0.0%ni, 36.4%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 0.1%us, 5.4%sy, 0.0%ni, 36.5%id, 58.1%wa, 0.0%hi, 0.0%si, 0.0%st
Given all this, the throughput reported by running "dstat 10" is in the range of 2200-2300 MB/sec.
Given the math above I would expect something in the range of 2*1920 ~= 3600+ MB/sec.
Does anybody have any idea where my missing bandwidth went?
Thanks!
Nice, well prepared question :)
I'm a speeds'n'feeds-man myself and I think you're on the money to be honest. I was half expecting to see your throughput lower than it is but what I think you've got there is a build-up in minor, and expected, inefficiencies. For instance it's very hard for a PCIe bus to get to 100% all the time, better to assume a low 90% overall rate. Given the jitter this will cause it will also mean that the PHYs won't be 100% 'fed' all the time so you lose a bit there to, same for the cache, disks, non-coalraced interrupts, IO scheduling etc. Basically it minor ineffeciency times minor inefficiency times...and so on, it ends up being more than the 5-10% expected inefficencies on their own. I've seen this kind of thing with HP DL servers talking to their MSA SAS boxes using W2K3 and then being NLB'ed over multiple NICs - frustrating but understandable I guess. That's my 2c anyway, sorry it's not too positive.