We're running into an interesting conundrum that I'd appreciate some help troubleshooting. We have a service that has several processes. To distribute load, we can startup n-processes of most types. So for example, if we expect 200,000 connections and know that each of a certain process type can handle around 5,000 connections before pegging out at 100% CPU, we know we should have at least 40 of these process types running to handle the load.
Recently, we've started consolidating our services to make better use of our hardware. During load testing though, we've seen that changing nothing other than how many of a certain process type are on a single box doubles the CPU% of each process.
Here's a screenshot of the process CPU%:
Here's a screenshot of the host CPU%:
The test from earlier had about 12 instances of this process on it; the test from later doubled the count. I'd say this would make sense if the box just couldn't handle the load, but from what I see it doesn't look like the case.
top - 14:55:08 up 54 days, 18:30, 1 user, load average: 22.26, 22.39, 22.03
Tasks: 581 total, 1 running, 580 sleeping, 0 stopped, 0 zombie
%Cpu(s): 32.8 us, 3.1 sy, 0.0 ni, 62.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 26385841+total, 16612808+free, 20537016 used, 77193320 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 24167782+avail Mem
Load average is within range (this is a 28-core server, 256GB of memory). Disk I/O has a wa
of 0.0
. I'm not sure what's causing the increased CPU%. Any ideas on what else to look for? Why does doubling the count of processes also double the amount of CPU time required for each process, if the CPU (according to top) is actually under utilized?