On a machine (AWS m5.large) that only runs nice
'd background processing jobs (i.e. no web/DB/etc servers present), are there disadvantages to consistently running the CPU at 100%?
I understand that running the system such that it consumes 100% of the available memory is not a good idea. Without swap, the system will simply kill processes when it runs out of memory. Even with swap, the system will start swapping out pages which slows the entire system down dramatically.
However, my understanding is that a system with nice
'd processes running at 100% CPU usage will function without dramatic slowdowns. Is this correct?
Or, would it be better to try to configure the background processes such that the system stays within the range of 60% - 90% CPU usage?
So long as the system is doing what you need and is responsive to logins and changes, running 100% CPU is no problem, that's what it's for. Nice only changes the relative priority of processes.
In AWS avoid the T series instances if you're using 100% of a CPU, as they give fractional CPU. Over the CPU allocation it's cheaper to get a M (general purpose) / C (Compute Intensive) / other series VM for committed CPU than using "T2 / T3 unlimited".
To address a comment, AWS (and I assume other leading cloud providers) do not have a "fair use" policy for CPU, that tends to be from low end providers or shared hosts. If you pay for a core, you can use the core 100%. If your instances are underutilized the AWS Trusted Advisor service recommends smaller instances to help you save money.
On-premise you can obviously do whatever you like. This answer is the common case and applies to cloud, AWS specifically.
Whether you
nice
or not, running at 100% CPU means you are not processing your jobs as quickly as you could be, if you had more CPU available. The entire system does indeed slow down. The only thingnice
does for you is to let you indicate which processes have higher or lower priority and should have more or less of your already limited CPU.If your jobs are slower than you expect, the only thing that will make any significant difference is to give them more CPU. If you take it from other jobs, then those jobs will slow down. If you upgrade your CPU, then everything will run faster. Of course, since it's EC2, you could also just add more instances.
There is no problem in running a CPU at 100%.
Even in the unlikely case that your specific hardware had a cooling problem leading to overheating on , as this is an AWS server, that'd be Amazon's issue, not yours (rest assured, they took that into account in their pricing model)
If it didn't do that job, it would be sittling idle, so if you need to have $job done, better have it doing it. You don't want to artificially restrict it.
The main disadvantage would be using the CPU continuously at 100% will need more power. But you wanted that task done, right?¹
(¹ Do note that in some cases like bitcoin mining, the cost of electricity is higher than the value of the mined bitcoins)
Second, if the system CPU is fully used at 100% doing some not-too-important task (like crunching SETI packets), it could happen that something more important arrived (such as an interactive request by the owner), but the computer doesn't pay attention to that very promptly because it was busy processing those packets. This is solved by nicing that less-important task. Then the system knowns how to prioritise them and you avoid this problem.
In some places you may see that it is bad to have a server working at 100%. A server with CPU at 100% shows a bottleneck in the process. You could produce more with more cpus or quicker ones, but as long as you are happy enough with the throughput, it's ok. You can think on it as a shop where all clerks where always busy. This is probably bad, as more customers can't shop there since they can't be served.
However, if we have a warehouse with items to sort, with no special deadline, and enough work for the following 5 years, you would want have everyone working full time on it, not keeping someone idle.
If the warehouse is near the shop, you can do combine things: you have the clerks serving customers, and when there are no customers left, they advance sorting the warehouse, until the next client arrives.
Traditionally, you have certain dedicated hardware and it's up to you to use it more or less. In a model like AWS, you have more options, though. (Note: I am assuming your task is made up by many small, easily parallelizable chunks)
In some cases you could use several smaller instances for the cost of a big one, getting more results (while for other task sets it wouldn't).
Plus, the costs aren't fixed. You can probably benefit by launching extra instances off-hours, when they are cheaper but shrinking them when it'd be more expensive. Suppose you were able to borrow the clerk of nearby stores (at certain variable rate). The open-24-hours shop could happily let you have the employee doing the night shift sort some of your warehouse items quite cheaply, since only a handful of customers will pass by. However, if you wanted some extra pair of hands on Black Friday, that would be much more expensive. (in fact, better not to have anyone left sorting the warehouse that day)
AWS lets you do a lot of dynamic load, and when you don't have to the responses in X time, you can optimize your costs noticeably. However, they have "too many options", and they are complex to understand. You also need to understand pretty well your workload, in order to take the right decisions.
Yes, running at 100% CPU is fine, no need to use
nice
here, which only demotes the processes' priority vs normal processes, which you don't have.If these are calculation only processes with a definite end and no interactivity is expected, I'd go a step further and use
SCHED_BATCH
, which increases time slices to over a second and in a low-memory situation prioritizes making progress over fairness in scheduling, under the assumption that processes eventually terminate and free all their memory if you give them more CPU time.It Depends
Some workloads, such as Machine Learning, 3-D Rendering, Media Transcoding, Cryptocurrency Mining, are designed to run at 100% CPU(*). These types of workloads are often optimized to divide up their tasks into equal-shaped blocks and to use 100% of every instruction pipeline of every CPU on the box. If you make a stink about 100% CPU utilization in these instances, your co-workers will think you're an idiot. Your question does not mention any of these specialized workloads, so read on.
For general business workloads on the other hand, you're often dealing with over-complicated and poorly written software that has to process tasks in irregular-shaped blocks arriving at unpredictable intervals. For this type of workload, CPU starvation can lead to system instability and death spirals due to "co-morbidities". Some of these "co-morbidities" include unpredictable memory utilization, database connections, database locking, and time-out configurations.
Example: Suppose you have a process that takes two minutes to complete when it has 100% of the CPU to itself, but the time increases to 10 minutes when it has to share the CPU with four other processes. Now suppose each process holds an external database connection the entire time while it's running, and the connection pool recycles connections older than 10 minutes. And then...
That's the sound of your pager going off in the middle of the night due to mysterious failure of this batch job that hasn't been modified in months, because either the number of database connections got maxed out or the connection duration started to creep up above the configured 10-minute maximum. A death spiral kicks off as processes go into re-try mode and new tasks arrive, and very soon you can't even get any telemetry or login to the instance.
(*) Let's ignore GPU-bound workloads for now, that would be a whole new question.