I am running into an issue performance tuning a SAN. I am testing 24 mount points that are RAID - 5 on an EMC DMX with SQLIO. The host I am testing on has 256GB of RAM and 32 cores.
I am using a Param file in my command line which looks like this:
M:\ASRS\ASRS_SQLData01A\testfile.dat 8 0x0 6000
M:\ASRS\ASRS_SQLData02\testfile.dat 8 0x0 6000
M:\ASRS\ASRS_SQLData03\testfile.dat 8 0x0 6000
A sample command line looks like this:
call sqlio -kR -s60 -fsequential -o8 -b64 -LS -Fparam.txt
My question is this:
When I test just 1 mount point I see 850MB/sec and 14k IOs/Sec, but when I test multiple files 850MB/sec is the most I ever see. So I believe I am hitting a bottleneck somewhere. The host has 8 4-gigabit fiber channel cards in it so I have a hard time believing it’s that, so I am stuck in “guessing” it’s the HBA/SP’s or SQLIO.
Is there something I am missing that could be the bottle neck? Is this normal behavior or should SQLIO be aggregating up the throughput across all mount points?
As a side note, in an attempt to prove that SQLIO was not the problem and that it wasn’t “averaging” the bandwidth across files, I ran 2 instances of SQLIO at the same time on different mount points and saw roughly 400mb/s on each. To me that proved that it’s not SQLIO.
Is PowerPath (or the equivalent in your system) set up to load balance the HBAs properly? Are all of the HBAs working properly? You should just be able to pop onto the server and have a look at the Powerpath configuration to get those answers.
It's always worth having a look in the windows Event log to see if any messages are popping up from the HBAs or powerpath.
I can't remember if DMX uses storage pools or not, but some good, basic questions when looking at SAN performance are: How many disks is that storage spread over? More is usually better. If it is just a few disks, question it. As long as you are asking about disks, you might as well ask about RPM rates. Faster is better and 15K is best if you can't get SSD (which you probably can't). Do all of those mount points reference the different areas of the same disk(s)? Is the SQL Server sharing those disks with other applications? How much write cache is available on the DMX, and are my test files large enough so that they don't all fit in the cache?
(History lesson: IIRC, super-old DMXes used SCSI drives and (parallel!) buses for connecting the service processor(s) to the disks. IIRC, a SCSI-3 bus, which would hold up to 15 disks, could be saturated by the IO for just 3 or 4 15KRPM disks and simply couldn't keep up with 15 (or even 7) disks. Which is why, more or less, we have SAS.)
SAN admins may tell you that there is so much write cache in the DMX that you can't overwhelm it. This is not necessarily true (I had such an incident with a DMX 8 years ago, with a new, fancy Itanium SQL Server pushing data into it.). They are often correct; they have this opinion because they usually are worried about storage space and utilization more than storage performance. BUT many SAN admins do not realize how fast SQL Server can generate data (for testing, do a couple of cross joins between some system tables and stick the resulting data into a temporary table with SELECT INTO, then watch the I/O on the log file.)
SAN admins may also tell you that there plenty of disks underneath your LUNs, which can also be debatable. For reference, go to tpc.org and look at the way that storage systems are set up for benchmarking. Remember, once the DMX (or anything else) runs out of write cache, the system has to rely on the abilities of the underlying disks.
The SAN admins should be able to tell if tests are running out of write cache or if the disks that your server's data is on are overloaded.
That's a good number of HBAs; I've never had more than 4x4gb/sec HBAs. Are you sure that you aren't seeing some sort of contention or bottleneck on your PCIe backplane? Different kinds of PCIe have different data rates.
Are you sure that all those cores are loading up evenly when you run sqlio, and that none of them are hitting 100%? A quick look at Task Manager will tell you.
Beyond that, I think that you would want a SAN admin to look at the SAN side, including whatever fabric switches are between your server and the DMX.