Some of our application owners are saying several processes are taking double the time to run that they should.
This one has our head scratching.
We cannot understand why some operations are taking double the time on Server 1 than they take on Server 2.
Server 1: IBM x3850 M2 (RHEL 4 Nahant Update 8)
Server 1 is mostly idle from an IO standpoint. S1 and S2 are both on SAS drives in Raid 5. Server 1 has 4 drives, Server 2 has 4 drives. Iostat output from server 1
Linux [hostname-removed] 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:34:33 EDT 2009 i686 i686 i386 GNU/Linux
Output of /proc/cpuinfo
Output of /proc/meminfo
Server 2: IBM x3650 (RHEL 4 Nahant Update 8)
Server 2 is the more active of the two servers. The iostat output looks like there are a ton of devices attached because of SAN multipathing. The dd operation and tar operation done were on local storage. Iostat output from server 2
Linux [hostname-removed] 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan 7 17:52:47 EST 2009 i686 i686 i386 GNU/Linux
Output of /proc/cpuinfo
Output of /proc/meminfo
As expected, the operation of writing a 1GB file is quicker on Server 1
[server1]$ time dd if=/dev/zero of=bigfile bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
real 0m15.032s
user 0m0.961s
sys 0m11.389s
Versus Server 2, this seems to check out:
[server2]$ time dd if=/dev/zero of=bigfile bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
real 0m27.519s
user 0m0.531s
sys 0m8.612s
However, tarballing that same file on Server 1 takes twice as long on the 'user' time and a bit longer on real time.
[server1]$ time tar -czf server1.tgz bigfile
real 0m27.696s
user 0m20.977s
sys 0m5.294s
[server2]$ time tar -czf server2.tgz bigfile
real 0m23.300s
user 0m10.378s
sys 0m3.603s
Massive I/O operations performance much more depends on HDD speed and current I/O load, rather than CPU.
These are exactly the kind of problems a tool like collectl is ideal for addressing. Producing the time it takes for dd or tar to run is a good start, but what is happening in between? Are your i/o rates steady or are they hitting valleys and stalls? There are all kinds of things that can go wrong from start to finish.
Since you have a system with a known 'good' performance profile you're in the best position to actually solve this problem. Run your tests along with collectl and watch your cpu, memory, network and disks (all on the same line making it real easy to see trends over time). You can also look at things like nfs, tcp, sockets, and several other things but I suspect this doesn't apply to this case.
Now repeat the test on the box knowing to have poor performance and see what is different. The answer WILL be there. Could be starved memory, too many interrupts on the cpu (collectl can show you this too), or large i/o wait times. Whatever it is collectl can identify it for you, but then you have to figure out what is the root cause. Could be a highly fragments or even bad disk. Maybe there's something wrong with a controller. That part is for you to figure out.
Hope this helps...
-mark