I'm not sure whether serverfault is the right place to ask this, but I wonder what choice you would make if you had to select a new CPU type for your Java Web Application:
a) a CPU with 32 cores and clock speed 2.5 Ghz
or
b) a CPU with 8 cores but clock speed of 3.8 Ghz
Given the fact that each of the web application's incoming HTTP request is served by a free Java thread, it might make sense to choose a), because you can process four time more HTTP requests at the same time. However, on the other hand, CPU b) can finish the processing of a single HTTP request much faster...
What do you think?
Sidenotes:
- it has to be a physical machine, VMs or cloud solutions are not an option in this case
- RAM is not important, the server will have 512GB of RAM in the end
- Caching: the Java web application features an extensive caching framework, so the choice is really on the CPUs.
tldr; The real answer is probably "more RAM", but as you've asked your question the answer is, of course, it depends. Then again, 32 cores @2.5Ghz will almost certainly beat 8 cores @3.8Ghz - it's 4 times more cores vs. 1.5 times faster clock. Not a very fair fight.
A few factors you should consider are transaction response time, concurrent users and application architecture.
Transaction response time If your Java application responds to most requests in a few milliseconds then having more cores to handle more concurrent requests is probably the way to go. But if your application mostly handles longer running, more complex transactions it might benefit from faster cores. (or it might not - see below)
Concurrent users and requests If your Java application receives a large number of concurrent requests then more cores will probably help. If you don't have that many concurrent requests then you might just be paying for a bunch of extra idle cores.
Application architecture Those long running requests I mentioned won't benefit much from faster cores if the app server spends most of the transaction time waiting for responses from web services, databases, kafaka/mq/etc. I've seen plenty of applications with 20-30 second transactions that only spend a small portion of their response time processing in the application itself, and the rest of the time waiting for responses from databases and web services.
You also have to make sure the different parts of your application fit together well. It doesn't do you much good to have 32 or 64 threads each handling a request all queuing up waiting for one of 10 connections in JDBC pool, aka the pig in a python problem. A bit of planning and design now will save you a lot of performance troubleshooting later.
One last thing - what CPUs could you possibly be comparing? The cheapest 32 core 2.5 GHz CPU I can find costs at least 3 or 4 times more than any 8 core 3.8 Ghz CPU.
Assuming your Java web server is appropriately configured, you should go for more cores.
There are still dependencies, like semaphores, concurrent accesses that will still have some threads waiting, whatever the number of cores or speed. But it's better when it's managed by the CPU (cores) than by the OS (multi-threading).
And anyway, 32 cores @2.5Ghz will handle more threads and better than 8 cores @3.8Ghz.
Also, the heat produced by the CPU depends on the frequency (among other things) and this is not linear. Meaning, 3.8Ghz will generate more heat than 3.8/2.5 x (has to be confirmed based on your exact CPUs types/brands... many sites offer detailed information).
You tell us that the a request takes about 100-200 ms to execute, and that it's mostly processing time (though it's difficult to separate what is actual CPU execution from what is in reality memory access), very little I/O, waits for databases, etc.
You would have to benchmark how long it actually takes on each of the two CPUs, but let's suppose it takes 150 ms on the slower CPU (with 32 cores) and 100 ms on the faster one (with only 8 cores).
Then the first CPU would be able to handle up to 32/0.15 = 213 requests per second.
The second CPU would be able to handle up to 8/0.1 = 80 requests per second.
So the big question is: how many requests per second do you expect? If you are nowhere near dozens of requests per second, then you don't need the first CPU, and the second one will give you faster execution time on each request. If you do need over 100 requests per second, then the first one makes sense (or it probably makes even more sense to have more than one server).
Note that this is very very back-of-the-envelope-type estimations. The only way to know for sure is to benchmark each of the servers with a real-life load. As stated above, fast CPUs or CPUs with lots of cores can quickly become starved for memory access. The size of the various CPU caches is very important here, as well as the "working set" of each request. And that's considering truly CPU-bound work, with no system calls, no shared resources, no I/O...
Faster cores are generally better than more cores. IE if two processors have the same price, memory bandwidth, and multi-threaded benchmark scores, prefer the one with fewer faster cores.
More cores only help if you have enough concurrent requests.
Faster cores improve both total throughput and improve the response time for each request.
Preliminary note
I'd like to second @PossiblyUsefulProbablyNot's definitely useful answer.
Especially this point.
Caveat
Not so much of an admin per sé.
More of a software engineering perspective, maybe.
No alternative to measurement
What we know
So, the machine is
Not all that vague a picture, the OP is painting. But at the same time far from adequate enough data to give an answer pertaining to the OPs individual situation.
Sure, 32 cores at 2/3 the clock speed is likely to perform better than 1/4 of the cores at comparatively small a speed advantage. Sure, heat generated doesn't scale well with clock speeds above the 4GHz threshold. And sure, if I'd have to blindly have to put my eggs in one basket, I'd pick the 32 cores any day of the week.
What we don't know
Way too much, still.
However, beyond these simple truths, I'd be very skeptical of an hypothetical attempt at a more concrete and objective answer. Iff it is at possible (and you have ample reason to remain convinced about ops per unit time being a valid concern), get your hands on the hardware you intend to run the system on, measure and test it, end-to-end.
An informed decision involves relevant and believable data.
In the vast majority of cases, memory is the bottleneck.
Granted, the OP is primarily asking about CPU cores vs. clock speed and thus memory appears on the fringes of being off-topic.
I don't think it is, though. To me, it appears much more likely the question if based on a false premise. Now, don't get me wrong, @OP, your question is on-topic, well phrased and your concern obviously real. I am simply not convinced that the answer to which CPU would perform "better" in your use-case is at all relevant (to you).
Why memory matters (to the CPU)
Main memory is excruciatingly slow.
Historically, as compared to the hard drive, we tend to think of RAM as "the fast type of storage". In the context of that comparison, it still holds true. However, over the course of the recent decades, processor speeds have consistently grown at significantly more rapid a rate than has the performance of DRAM. This development over time has led to what is commonly known as the "Processor-Memory-Gap".
Fetching a cache line from main memory into a CPU register occupies roughly ~100 clock cycles of time. During this time, your operating system will report one of the two hardware threads in one of the 4 (?) cores of your x86 architecture as busy.
As far as the availability of this hardware thread is concerned, your OS ain't lying, it is busy waiting. However, the processing unit itself, disregarding the cache line that is crawling towards it, is de facto idle.
No instructions / operations / calculations performed during this time.
Bottom line If proper measurement is not an option, rather than debating cores vs. clock speed, the safest investment for excess hardware budget is in CPU cache size.
So, if memory is regularly keeping individual hardware threads idle, surely more ~cow bell~ cores is the solution?
In theory, if software was ready, multi/hyper-threading could be fast
Suppose you are looking at you tax returns (e.g.) of the last few years, say 8 years of data in total. You are holding 12 monthly values (columns) per year (row).
Now, a byte can hold 256 individual values (as its 8 individual binary digits, may assume 2 states each, which results in
8^2 = 256
permutations of distinct state. Regardless of the currency, 256 feels a little on the low end to be able to represent the upper boundary of salary figures. Further, for the sake of argument, let's assume the smallest denomination ("cents") to not matter (everybody earns whole integer values of the main denomination). Lastly suppose the employer is aware of the salary gap between upper management and the regular workforce and hence keeps those selected few in an entirely different accounting system altogether.So, in this simplified scenario, let's assume that twice the aforementioned amount of memory space, i.e. 2 byte (or a "halfword"), when used in
unsigned
form, i.e. representing the range from[0, 2^16 = 65536)
, suffices to express all employee's monthly salary values.So in the language / RDBS / OS of your choice, you are now holding a matrix (some 2-dimensional data structure, a "list of lists") with values of uniform data size (2-byte / 16 Bit).
In, say C++, that would be a
std::vector<std::vector<uint16_t>>
. I am guessing you'd use avector
ofvector
ofshort
in Java as well.Now, here's the prize question:
Say you want to adjust the values for those 8 years for inflation (or some other arbitrary reason to write to the address space). We are looking at a uniform distribution of 16 Bit values. You will need to visit every value in the matrix once, read it, modify it, and then write it to the address space.
Does it matter how you go about traversing the data?
The answer is: yes, very much so. If you iterate over the rows first (the inner data structure), you will get near perfect scalability in a concurrent execution environment. Here, an extra thread and hence half the data in one and the other half in the other will run you job twice as fast. 4 threads? 4 times the performance gain.
If however you choose to do the columns first, two threads will run your task significantly slower. You will need approx 10 parallel threads of execution to only to mitigate (!) the negative effect that the choice of major traversal direction just had. And as long as your code ran in a single thread of execution, you couldn't have measured a difference.
All else being equal:
--> Consider cache size, memory size, the hardware's speculative pre-fetching capabilities and running software that can actually leverage parallelisation all more important than clock speed.
--> Even without reliance on 3rd party distributed systems, make sure you truly aren't I/O bound under production conditions. If you must have the hardware in-house and can't let AWS / GCloud / Azure / Heroku / Whatever-XaaS-IsHipNow deal with that pain, spend on the SSDs you put your DB on. While you do not want to have the database live on the same physical machine as does your application, make sure the network distance (measure latency here too) is as short as possible.
--> The choice of a renowned, vetted, top-of-the-line, "Enterprise-level" HTTP Server Library that is beyond the shadow of any doubt built for concurrency, does not alone suffice. Make sure any 3rd party libraries you run in your routes are. Make sure your in-house code is as well.
This I get.
Various valid reasons exist.
But this not so much.
Neither AWS nor Azure invented distributed systems, micro-clustering or load balancing. It's more painful to setup on bare metal hardware and without MegaCorp-style resources, but you can run a distributed mesh of K8 clusters right in your own living room. And tooling for recurring health checks and automatic provisioning on peak load exists for self-hosted projects too.
Here's a ~hypothetical~ reproducible scenario: Enable zram as your swapspace, because, RAM is cheap and not important and all that. Now run a steady, memory-intensive task that doesn't result in frequent paging exactly. When you have reached the point of serious LRU inversion, your fan will get loud and your CPU cores hot - because it is busy dealing with memory management (moving crap in and out of swap).
In case I haven't expressed myself clearly enough: I think you should reconsider this opinion.
TL;DR?
32 cores.
More is better.