I'm not a storage guy. I know how to spell SAN and a few basics beyond that, but not much farther.
Are std disk counters reliable in measuring against SAN storage? We have 2 MS SQL (2005) servers both attached to the same SAN that started experiencing problems yesterday. We're not in control of the hardware so I don't have much information on how the storage is configured, other than what I see down to the LUN via Veritas Enterprise Admin (ie- just basic volume configuration). I don't have any access to the tools to monitor throughput on the controllers or switches.
In lieu of that I was running perfmon counters (% disk time for physical and logical, disk queue length for both physical and logical). The numbers for % disk time for Physical Disk seem just whack - up to 32000% (yes, 32K).
Is that right, or am I correct in thinking that something is aggregating up from below the LUN level to make that metric and this counter is not something that I should use against SAN storage?
EDIT:
Should add that we've recently discovered that one of the 32 cache modules is having issues and was taken out of the mix. I know it's a Hitachi, but I don't know any specifics as to model.
UPDATE:
Hitachi just finished swapping out the faulty memory module and reinitializing the fiber port card, now things appear to be back to normal. Thanks for the info guys!
The apparently mad numbers for %Disk Time do indicate something but the way %Disk Time is derived by Perfmon means that numbers>100% are not impossible.
%Disk time is actually a calculated counter and it comes from:
Avg Disk Sec/transfer takes the sum of the completion times for all IO's in the current interval and divides by the number of IO's giving an average end to end completion time. Disk Transfers per second is simply the total number of complete IO's divided by the interval.
Many of those IO's may have been initiated outside of the current interval so their product may be >100%. This can happen on any system but it will exceed 100% more often on complex disk arrays like a SAN.
Because of the way it's calculated %Disk Time doesn't really tell you much, although in this case it is telling you something is wrong. Calculating utilization using (100-%idle time) is a better idea as %idle time is actually directly measured.
Disk Queue Lengths can be much larger than they would be on a simple local storage setup but in generally if Queue Length is >> the number of spindles backing the LUN then things are backing up, especially if the Queue Length rises steadily for any significant period of time. A value of 10 or even 20 on a LUN with 10-15 disks wouldn't be a problem at all but 350 is definitely saying something is screwed up. A faulty or poorly configured Cache could certainly cause problems like that but there could be other reasons too.
That said if you want to know what's you really have to look at performance monitoring at the SAN level itself and you will have to get that from your SAN folks. The problem may be with the disks on the LUN (maybe a disk has failed and a RAID rebuild is going on, possibly cache is disabled for some reason, maybe other LUNs striped off the same disks have a higher priority and are busy), possibly the cache is disabled\failed on that particular array, maybe the SAN fabric or switches are experiencing issues.
There's and old but very good article on Disk Counters in Windows here.
What are your 'Avg. Disk Read Queue Length' and 'Avg. Disk WriteQueue Length' perfmon values for those LUNs, how does each server compare to each other.
If you can negotiate some quiet time with your SAN guys then you could run IOZone on both machines and compare results.
Some counters are useful to you and some aren't. Things like current disk queue will tell you the queueing that the Windows Host sees between when it sends the read/write command and that command is processed against the cache in the SAN. But if the disks are running fine, you can still see queueing on the host because of cache issues, switch issues or fibre issues.
Things like seconds per read and seconds per write will work the same way, they tell you how long it took to write to the cache.
Numbers like IO writes per second are a little more useful. Again this is IO to the SAN cache, but that IO has to make it to the disk as some point. Same goes for IO reads per second. The is reads from disk and cache, but if it is in the read cache it came off the disk at some point.