My question closely relates to my last question here on serverfault.
I was copying about 5GB from a 10 year old desktop computer to the server. The copy was done in Windows Explorer. In this situation I would assume the server to be bored by the dataflow.
But as usual with this server, it really slowed down. At least I could work with the remote session, even there was some serious latency. The copy took its time (20min?). In this time I went to a colleague and he tried to log in in the same server via remote desktop (for some other reason). It took about a minute to get to the login screen, a minute to open the control panel, a minute to open the performance monitor, ... Icons were loading maybe one per second. We saw the following (from memory):
- CPU: 2%
- Avg. Queue Length: 50
- Pages/sec: 115 (?)
There was no other considerable activity on the server. The server seldom serves some ASP.NET pages, which became also very slow in this time.
The relevant configuration is as follows:
Windows 2003
SEAGATE ST3500631NS (7200 rpm, 500 GB)
LSI MegaRAID based RAID 5
4 disks, 1 hot spare
Write Through
- No read-ahead
- Direct Cache Mode
- Harddisk-Cache-Mode: off
Is this normal behaviour for such a configuration? What measurements could give further clues?
Is it reasonable to reduce the priority of such copy I/O and favour other processes like the remote desktop? How would you do that?
Many thanks!
Disc overload. That simple. Avg. queue length 50 - check "seconds per IO / Read / write" - that will be too high, too.
It looks a lot like you basically totally overload the discs, and having hard disc cache mode off does not help either (bad setting - at least put it onto read cache there... better write + ups - withut caching SATA NCQ can not work, killing your performance).
The main problem is your RAID 5 - it basically has all on it. FIle area AND operating system, so an overload overloads the whole system.
For real servers I use WD Scorpio Black in a Raid 10 (4 discs) for the operating system and (I only do virtual) virtualization root- the Raid 10 gives me better performance. For a high performance file server I would / do add a SECOND raid (can be raid 5) for the files. The trick here is that the file area and the operating system area are not never ever allowed to overlap (same discs). In your case - get a small hard disc (80gb or so) - two of them - and put a mirror on them and move the operating system onto that. Then the server is still usable when IO is piling up.
Pages/second says nothing - it means there is some virtual memory playing around. If that hits the discs during your file copy (likely, but this is another performance counter that marks physical activity as result of page faults) Then naturally it gets into the queue.
And please get caching on. Can LSI sell you some bbu (Battery backup unit?). I use Adaptec myself as RAID controller, and ever since I Have a BBU On them, I put the cache on write back (NOT write through) - the performance gain from optimizatons is significant.
The problem has been characterized well by the other answers but in short:
Your RAID array with 3 (active) 7200 RPM disks in RAID 5 has a write performance that is about 3/4 the speed of a single 7200RPM drive for extended copies. Given that you have disabled caching\read ahead etc the performance will be even worse than that. For the most part the performance of your server from a write perspective is pretty going to be poor with this config.
If your 5GB is a single large file (or a couple of fairly large files) and if your network based copy is being sent at faster than about 30Meg/sec (easy enough with a Gigabit connection) then your server's disks wont be able to keep up, the network copy buffering on the server will grow until it consumes all available memory on the server and then that will force the OS to start paging excessively further worsening your performance problem. Depending on what other things are actually happening on the server the copy speed that is needed to kill your system may be even lower than this, if there is any other sustained read\write activity, even at very low rates, then an inbound copy over a 100Meg connection might be enough to trigger this sort of problem.
Are you sure the RAID array wasn't being rebuilt? I've seen a rebuild/verify bring a box to its knees. You might even have a drive that is marginal and can't keep up with the others, but isn't throwing error codes (yet).
A 'RAID' drive should immediately tell the controller that it has a problem, 'consumer' (they're the same, but with different firmware) drives will keep retrying a failed request instead of defaulting to a fast fail. I've had a few that eventually got dropped from an array due to timeouts under load. They'd check out alright and rebuild (usually) without incident, only to start timing out again as soon as the box was under load. The constant rebuilds and stalling drives would bring the box to a standstill after a few rebuild cycles.