We are running a Ruby on Rails web app under Unicorn. Our app is not strictly CPU bound (we have a dual Xeon E5645 system w/12 cores and a peak load average value is around 6). We started with 40 Unicorn workers initially but application memory footprint increased over time. So, now we have to lower the number of worker processes. I thought that the standard (number of CPU cores + 1) formula applies to Unicorn too but my colleague tried to convince me we should reserve more Unicorn instances per CPU and provided this link. Yet, I am not exactly sure why do we need to spend so much memory on idle Unicorn processes.
My question is: what is the reason to have more than one Unicorn instance per CPU core? Is it due to some architectural peculiarity of Unicorn? I am aware that busy Unicorn processes can't accept new connections (we are using UNIX domain sockets to communicate to Unicorn instances BTW) but I thought backlog was introduced exactly to address this. Is it possible to overcome this 2 to 8 Unicorn instances per CPU rule anyhow?
Okay, I have found the answer finally. The optimal number of Unicorn workers is not directly connected to the number of CPU cores, it depends on your load and internal app structure/responsiveness. Basically we use sampling profiler to determine workers' state, we try to keep workers 70% idle and 30% doing the actual work. So, 70% of the samples should be "waiting on the select() call to get a request from the frontend server". Our research has shown that there are only 3 effective states of workers: 0-30% of samples are idle, 30-50% of samples are idle and 50-70% of samples are idle (yes we can get more idle samples but there is no real point in it because application responsiveness does not change significantly). We consider 0-30% situation a "red zone" and 30-50% situation a "yellow zone".
You're right about N+1 for CPU-bound jobs.
On the other hand, unicorn does not use threads, so every IO op. blocks the process and another process may kick in and parse HTTP headers, concatenate strings and do every CPU-intensive tasks it needs to serve the user (doing it earlier to reduce request latency).
And you may want to have more threads/processes then cores. Imagine following situation: req. A takes ten times more then req. B, you have several concurrent A requests and fast B request is just enqueued waiting for A-req to complete. So if you can predict number of heavy requests, you can use this number as another guideline to tune the system.