I have a server with ESXi 5 and iSCSI attached network storage(4x1Tb Raid-Z on freenas). Those two machines are connected to each other with Gigabit ethernet, and a procurve switch in between.
After a while, if I have many(4-5 or more) vms running, they start to get un-responsive (long delays before anything happens). We are trying to find the reason behind this.
Today we looked at esxtop, and found that DAVG of that iSCSI LUN stays at 70-80. I read that +30 is critical!
What could be causing those high response-times?
As you probably already know, DAVG refers to disk latency, and yeah, greater than 30msec is usually going to give you a noticeable decrease in performance and responsiveness. Latency can be caused by a lot of issues but first and foremost your disks must be able to handle the IO load you are throwing at them.
IO load refers not only to the # of IO's per second (IOPS), but also the pattern. Random (pattern) I/O is pretty much what you expect from virtualized servers, so your disk configuration needs to do well from a random I/O perspective. Unfortunately, RAID-Z doesn't fit the bill. According to Oracle:
Oracle says here that a RAID-Z set can handle about the same number of random IOPS as a single disk in the set. A single 7.2k disk can do about 80 IOPS (and that may be a generous number, depending on who you ask), so that means in RAID-Z your entire array can only do 80 random IOPS. Running 5-7 servers on that few IOPS is a recipe for terrible performance.
You would see far better performance if you configured your 4 drives in a RAID-10 set. If you need more than 2TB RAW capacity (which is what you'd get in RAID-10), do RAID-5. Either will give you better random I/O performance than RAID-Z in this case.