I have two Dell R730 systems which have an identical hardware configuration purchased at the same time. Both are running RHEL6.9 where were imaged from the same image. It was imaged in January. I update the packages from the repository once a month so in general everything on the system should be "nearly" identical. (ie. any software or setting I change on one system gets changed on the other but since it is a manual process, there could be something missed)
I have noticed the performance on one system is 2.5X slower than the other. The jobs I am testing are single threaded CPU intensive. Reading some data files but very low disk io utilization according to iostat. Top shows the process is constantly pegged at 100% but the system has 88 threads and the load average is only approx 1. Very little memory utilization. No network utilization. (All files that it uses are local) One is a complex python script, another is a proprietary software program, both are running much slower on one system versus the other.
/proc/cpuinfo is identical. BIOS settings are identical. Only one user on the system. The faster system is connected to the internet, the slower one is on a standalone network.
In my investigations I've only found two differences. 1. The faster system is running BIOS version 2.25 the slower system is running BIOS version 2.43 2. The slower system has auditd running. However there is zero activity in the audit log during the process.
I realize this is difficult to debug but I am running out of ideas of what to look for. Are there some builtin software tools I can use to give more insight on what might be going on?
My recommendations today with EL6 systems on enterprise hardware are the following:
enterprise-storage
orlatency-performance
.sosreport
to try to get a summary of both systems' configs.Of course, you could also profile the processes...
top
,perf top
,pidstat
,strace
.Or look at the servers in realtime with Netdata and correlate all of the system metrics to see where the bottleneck(s) exist.
I also do the following in /etc/profile.d/tzfix.sh for good reason:
Just some ideas to start.
This is probably related to power management. Try putting both servers in high performance mode (power management disabled) and redo your performance tests.