Ping a Specific Port

Question

rinogo

Asked: 2020-09-04 06:18:02 +0800 CST2020-09-04 06:18:02 +0800 CST 2020-09-04 06:18:02 +0800 CST

Disadvantages of running an AWS worker server at 100% CPU

772

On a machine (AWS m5.large) that only runs nice'd background processing jobs (i.e. no web/DB/etc servers present), are there disadvantages to consistently running the CPU at 100%?

I understand that running the system such that it consumes 100% of the available memory is not a good idea. Without swap, the system will simply kill processes when it runs out of memory. Even with swap, the system will start swapping out pages which slows the entire system down dramatically.

However, my understanding is that a system with nice'd processes running at 100% CPU usage will function without dramatic slowdowns. Is this correct?

Or, would it be better to try to configure the background processes such that the system stays within the range of 60% - 90% CPU usage?

5 Answers

Voted

Tim · Answer 1 · 2020-09-04T10:24:35+08:00

Tim

2020-09-04T10:24:35+08:002020-09-04T10:24:35+08:00

So long as the system is doing what you need and is responsive to logins and changes, running 100% CPU is no problem, that's what it's for. Nice only changes the relative priority of processes.

In AWS avoid the T series instances if you're using 100% of a CPU, as they give fractional CPU. Over the CPU allocation it's cheaper to get a M (general purpose) / C (Compute Intensive) / other series VM for committed CPU than using "T2 / T3 unlimited".

To address a comment, AWS (and I assume other leading cloud providers) do not have a "fair use" policy for CPU, that tends to be from low end providers or shared hosts. If you pay for a core, you can use the core 100%. If your instances are underutilized the AWS Trusted Advisor service recommends smaller instances to help you save money.

On-premise you can obviously do whatever you like. This answer is the common case and applies to cloud, AWS specifically.

21

Michael Hampton · Answer 2 · 2020-09-04T08:28:26+08:00

Michael Hampton

2020-09-04T08:28:26+08:002020-09-04T08:28:26+08:00

Whether you nice or not, running at 100% CPU means you are not processing your jobs as quickly as you could be, if you had more CPU available. The entire system does indeed slow down. The only thing nice does for you is to let you indicate which processes have higher or lower priority and should have more or less of your already limited CPU.

If your jobs are slower than you expect, the only thing that will make any significant difference is to give them more CPU. If you take it from other jobs, then those jobs will slow down. If you upgrade your CPU, then everything will run faster. Of course, since it's EC2, you could also just add more instances.

6

Ángel · Answer 3 · 2020-09-04T19:27:11+08:00

There is no problem in running a CPU at 100%.

Even in the unlikely case that your specific hardware had a cooling problem leading to overheating on , as this is an AWS server, that'd be Amazon's issue, not yours (rest assured, they took that into account in their pricing model)

If it didn't do that job, it would be sittling idle, so if you need to have $job done, better have it doing it. You don't want to artificially restrict it.

The main disadvantage would be using the CPU continuously at 100% will need more power. But you wanted that task done, right?¹

(¹ Do note that in some cases like bitcoin mining, the cost of electricity is higher than the value of the mined bitcoins)

Second, if the system CPU is fully used at 100% doing some not-too-important task (like crunching SETI packets), it could happen that something more important arrived (such as an interactive request by the owner), but the computer doesn't pay attention to that very promptly because it was busy processing those packets. This is solved by nicing that less-important task. Then the system knowns how to prioritise them and you avoid this problem.

In some places you may see that it is bad to have a server working at 100%. A server with CPU at 100% shows a bottleneck in the process. You could produce more with more cpus or quicker ones, but as long as you are happy enough with the throughput, it's ok. You can think on it as a shop where all clerks where always busy. This is probably bad, as more customers can't shop there since they can't be served.

However, if we have a warehouse with items to sort, with no special deadline, and enough work for the following 5 years, you would want have everyone working full time on it, not keeping someone idle.

If the warehouse is near the shop, you can do combine things: you have the clerks serving customers, and when there are no customers left, they advance sorting the warehouse, until the next client arrives.

Traditionally, you have certain dedicated hardware and it's up to you to use it more or less. In a model like AWS, you have more options, though. (Note: I am assuming your task is made up by many small, easily parallelizable chunks)

Use a single instance of size X for as long as needed
Use a faster instance of size X+n
Use a slower but cheaper instance, taking more time
Use multiple instances

In some cases you could use several smaller instances for the cost of a big one, getting more results (while for other task sets it wouldn't).

Plus, the costs aren't fixed. You can probably benefit by launching extra instances off-hours, when they are cheaper but shrinking them when it'd be more expensive. Suppose you were able to borrow the clerk of nearby stores (at certain variable rate). The open-24-hours shop could happily let you have the employee doing the night shift sort some of your warehouse items quite cheaply, since only a handful of customers will pass by. However, if you wanted some extra pair of hands on Black Friday, that would be much more expensive. (in fact, better not to have anyone left sorting the warehouse that day)

AWS lets you do a lot of dynamic load, and when you don't have to the responses in X time, you can optimize your costs noticeably. However, they have "too many options", and they are complex to understand. You also need to understand pretty well your workload, in order to take the right decisions.

Simon Richter · Answer 4 · 2020-09-05T01:10:59+08:00

Simon Richter

2020-09-05T01:10:59+08:002020-09-05T01:10:59+08:00

Yes, running at 100% CPU is fine, no need to use nice here, which only demotes the processes' priority vs normal processes, which you don't have.

If these are calculation only processes with a definite end and no interactivity is expected, I'd go a step further and use SCHED_BATCH, which increases time slices to over a second and in a low-memory situation prioritizes making progress over fairness in scheduling, under the assumption that processes eventually terminate and free all their memory if you give them more CPU time.

2

Alex R · Answer 5 · 2020-09-06T11:15:46+08:00

It Depends

Some workloads, such as Machine Learning, 3-D Rendering, Media Transcoding, Cryptocurrency Mining, are designed to run at 100% CPU(*). These types of workloads are often optimized to divide up their tasks into equal-shaped blocks and to use 100% of every instruction pipeline of every CPU on the box. If you make a stink about 100% CPU utilization in these instances, your co-workers will think you're an idiot. Your question does not mention any of these specialized workloads, so read on.

For general business workloads on the other hand, you're often dealing with over-complicated and poorly written software that has to process tasks in irregular-shaped blocks arriving at unpredictable intervals. For this type of workload, CPU starvation can lead to system instability and death spirals due to "co-morbidities". Some of these "co-morbidities" include unpredictable memory utilization, database connections, database locking, and time-out configurations.

Example: Suppose you have a process that takes two minutes to complete when it has 100% of the CPU to itself, but the time increases to 10 minutes when it has to share the CPU with four other processes. Now suppose each process holds an external database connection the entire time while it's running, and the connection pool recycles connections older than 10 minutes. And then...

BZZZZZ.

That's the sound of your pager going off in the middle of the night due to mysterious failure of this batch job that hasn't been modified in months, because either the number of database connections got maxed out or the connection duration started to creep up above the configured 10-minute maximum. A death spiral kicks off as processes go into re-try mode and new tasks arrive, and very soon you can't even get any telemetry or login to the instance.

(*) Let's ignore GPU-bound workloads for now, that would be a whole new question.

Disadvantages of running an AWS worker server at 100% CPU

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?