Ping a Specific Port

Question

Vladimir Panteleev

Asked: 2010-12-11 20:56:21 +0800 CST2010-12-11 20:56:21 +0800 CST 2010-12-11 20:56:21 +0800 CST

How to stop Apache from crashing my entire server?

772

I maintain a Gentoo server with a few services, including Apache. It's fairly low-end (2GB of RAM and a low-end CPU with 2 cores). My problem is that, despite my best efforts, an over-loaded Apache crashes the entire server. In fact, at this point I'm close to being convinced that Linux is a horrible operating system that isn't worth anyone's time looking for stability under load.

Things I tried:

Adjusting oom_adj for the root Apache process (and thus all its children). That had close to no effect. When Apache was overloaded it would bring the system to a grind, as the system paged out everything else before it got to kill anything.
Turning off swap. Didn't help, it would unload memory paged to binaries of processes and other files on /, thus causing the same effect.
Putting it in a memory-limited cgroup (limited to 512 MB of RAM, 1/4th of the total). This "worked", at least in my own stress tests - except the server keeps crashing under load (basically stalling all other processes, inaccessible via SSH, etc.)
Running it with idle I/O priority. This wasn't a very good idea in the end, because it just caused the system load to climb indefinitely (into the thousands) with almost no visible effect - until you tried to access an unbuffered part of the disk. This caused the task to freeze. (So much for good I/O scheduling, eh?)
Limiting the number of concurrent connections to Apache. Setting the number too low caused web sites to become unresponsive due to most slots being occupied with long requests (file downloads).
I tried various Apache MPMs without much success (prefork, event, itk).
Switching from prefork/event+php-cgi+suphp to itk+mod_php. This improved performance, but didn't solve the actual problem.
Switching I/O schedulers (cfq to deadline).

Just to stress this out: I don't care if Apache itself goes down under load, I just want the rest of my system to remain stable. Of course, having Apache recover quickly after a brief period of intensive load would be great to have, but one step at a time.

Right now I am mostly dumbfounded by how can humanity, in this day and age, design an operating system where such a seemingly simple task (don't allow one system component to crash the entire system) seems practically impossible - or at least, very hard to do.

Please don't suggest things like VMs or "BUY MORE RAM".

Some more information gathered with a friend's help: The processes hang when the cgroup oom killer is invoked. Here's the call trace:

[<ffffffff8104b94b>] ? prepare_to_wait+0x70/0x7b
[<ffffffff810a9c73>] mem_cgroup_handle_oom+0xdf/0x180
[<ffffffff810a9559>] ? memcg_oom_wake_function+0x0/0x6d
[<ffffffff810aa041>] __mem_cgroup_try_charge+0x32d/0x478
[<ffffffff810aac67>] mem_cgroup_charge_common+0x48/0x73
[<ffffffff81081c98>] ? __lru_cache_add+0x60/0x62
[<ffffffff810aadc3>] mem_cgroup_newpage_charge+0x3b/0x4a
[<ffffffff8108ec38>] handle_mm_fault+0x305/0x8cf
[<ffffffff813c6276>] ? schedule+0x6ae/0x6fb
[<ffffffff8101f568>] do_page_fault+0x214/0x22b
[<ffffffff813c7e1f>] page_fault+0x1f/0x30

At this point, the apache memory cgroup is practically deadlocked, and burning CPU in syscalls (all with the above call trace). This seems like a problem in the cgroup implementation...

6 Answers

Voted

Alister Bulman · Answer 1 · 2010-12-13T10:07:15+08:00

I hate to say it, but you appear to be asking the wrong question.

It's not about stopping Apache from bringing down your server, it's about having your webserver serve more queries per second - enough so that you don't have a problem. A part of the answer to the reframed question is then limiting Apache so that it does not crash at high loads.

For the second part of that, Apache has some limits you can set - MaxClients being an important configuration. This limits how many children it's allowed to run. If you can take load off Apache for long-running processes (large files being downloaded for example), that's another slot in Apache to be able to serve PHP. If the file downloads have to be verified by the PHP layer, they can still do that, and pass back out to a more optimised webserver for the static content, such as with NginX sendfile

Meanwhile, forking Apache every on every single request for the slowest way to run PHP - as a CGI (whatever apache MPM you may be using) - is also having the machine spend large amounts of time not running your code. mod_php is significantly more optimised.

PHP can do huge amounts of traffic when Apache and the PHP layer are appropriately optimised. Yesterday, 11th Dec 2010, for example, the pair of PHP servers that I run did almost 19 Million hits in the 24hr period, and most of that in the 7am-8pm time-period.

There are plenty of other questions here, and articles elsewhere about optimising Apache and PHP, I think you need to read them first, before blaming Linux/Apache & PHP.

cyraxjoe · Answer 2 · 2010-12-14T12:09:58+08:00

When you are dealing with a production apache server, you MUST have an average process size, especially with php, I'll recommend you to:

Check your process averages memory consumption
Adjust MaxClients to AVERAGE_MEMORY / RAM_DEDICATED_TO_APACHE

Where RAM_DEDICATED_TO_APACHE it must be another estimation of the TOTAL_RAM minus the ram that needs the rest of the machine (and be generous with the database if you are running one in the same machine).

I really recommend you to use Varnish, you can easily run 2 servers on different ports on the save machine, and route the static files to an specialized file (media) server (lighthttpd, nginx) or an apache instance with worker and no extra modules. And of course catch the static content with varnish.

Split the load is important because you will be using the same amount of ram to deliver any static file (which needs less than 1MB) if you don't do it.

If you really need to make sure to never consume all the ram, you can install a new cronjob running each 2 minutes (less or more as you consider) with the following line, adjusting the 50 to any amount of the lowest ram, and keep this number above 30 at least; you'll need some ram to stop the server.

vmstat -S M | tail -n 1 | awk 'BEGIN{ "date" | getline date }{if($4 + $6 < 50){ system("/etc/init.d/httpd stop"); system("/etc/init.d/httpd start"); print "Rebooting apache  on " date >> "/var/log/apache-reboots.log"}}'

This is a very hakish (dirty) way of limit you ram, but it can be very helpful when you are not really sure about your average memory per apache process, and if you see several reboots in you log file ("/var/log/apache-reboots.log"), the you should tune your apache MaxClients, MaxRequestsPerChild, ThreadsPerChild to avoid futures hard-reboots, with the time and tunning, you will have the exact configuration for your server.

uesp · Answer 3 · 2010-12-13T07:51:29+08:00

A few general things you can try:

It is hard to tell from your description whether Apache/Linux is actually crashing or just becoming severely overloaded. I would suspect that you just have a server that has such a high load that the only effective course of action is cycling the power. I would approach the problem as an overloaded server unless there is specific evidence of an actual crash. If you optimize the server's performance but it still actually crashes then you can work to find and address that problem.
You generally never want your server to get in a condition of it regularly hitting the swap, especially any Apache instances. You can quickly get into a run away load situation where the server is running fine but when the traffic increases a few percent it starts to use the swap and the load sky rockets causing the site to be slow or inaccessible. To prevent Apache from use the swap reduce the number of max clients/connections and/or reduce the memory use by disabling any modules that are not needed. See also the next point.
You mention that connections in Apache are being used by long requests like file downloads. To help reduce this problem you can use a second web server (like lighttp) setup to serve just static content which Apache forwarding/redirecting requests to it. This frees up connections in Apache to do the heavy work and let you reduce the number of max clients/connections.
If you need to prevent a DoS, whether on purpose or accidental, there are various Apache modules you can install and setup. I've used mod_evasive and mod_limitipconn for example which worked well enough to prevent the less malicious types of DoS.
Don't dismiss optimizing Apache or other parts of the OS or application. Computers are very good at doing exactly what you tell them so if your Apache settings say "use more resources than this server has" then it will do exactly that. Like a lot of software, Apache is meant to work well with a huge range of hardware and applications but needs to be correctly setup for both. The default configuration only works well for a simple, low traffic web site.
With a little tuning you should be able to find a balance where the server gets to a high load but is still responsive enough to log in and check on it. At this point your options are either to profile and optimize the application, including adding caching layers, or get better hardware. This step should be after getting Apache setup correctly.

ollybee · Answer 4 · 2010-12-13T15:02:38+08:00

ollybee

2010-12-13T15:02:38+08:002010-12-13T15:02:38+08:00

Have you tried changing /proc/sys/vm/overcommit_memory to 2 ? This means the kenel will not allocate morememory than swap + a configurable percentage (proc/sys/vm/overcommit_ratio) of available ram.

In the case Apache will just fail as it't can't allocate the ram but services already loaded such as openSSH will continue to function.

I should add I have never tried this and just discovered this setting now. I would love to hear from any one who knows more. Otherwise I will test this tomorrow as I have exactly the same problem as described in the question.

1

Vladimir Panteleev · Answer 5 · 2011-01-13T22:36:03+08:00

Vladimir Panteleev

2011-01-13T22:36:03+08:002011-01-13T22:36:03+08:00

I found the problem...

Setting oom_adj to 15 for the whole memory-limited cgroup turned out to be very stupid. The adjusted score of all processes in the cgroup ended up all being 1000 - so when the cgroup ran out of memory, the system killed random processes and generally misbehaved.

I haven't had any system crashes after simply removing the line that set oom_adj.

1

RapidWebs · Answer 6 · 2014-07-08T02:50:58+08:00

this may be a little bit late, but I can say that blaming the OS is simply not the way to go. the OS is designed to meet the expectations of several different use case scenarios, therefore, you MUST configure it to meet your requirements.

not only this, but if you are having so much load that the system is crashing, then you have to optimize your system, or expand your network.

while over-optimizing too early can make things painful later on, not optimizing anything at all from the very beginning can have the very same consequences. it's all about balance.

however, you claim your goal is to prevent the system from crashing.. but then go on to say your solutions did not work. but some of them did work, you just were not happy with the results.

when you run out of memory, you swap. or things crash. end of story. if you don't want to swap, you have to:

a) limit your incoming connections. this has the effect of either turning people away

b) send them to the backlog. which has the effect of causing the site to slow, or die.

c) buy more memory. which you did not want to do

d) scale your network. which you also did not want to do

e) load balance. which is much like 'D'

without careful optimization, fine tuning, and expansion.. you cannot prevent all these things from happening.

in my experience, I learned that by using a granular mix of all the above generally caused things to work out in the end.

first off, I use apache2 + mpm_event + mod_fcgid. i'd carefully configure just about every possible option apache has to configure. this might take one evening to do, and another to get right. but it will be worth it.

I'd ensure that there is always one pool of workers ready to handle incoming connections, and let it grow, but cap this pool at some reasonable limit. this may sacrifice some speed, but results in stability.

second, I use both CGroups and IO Priority / CPU Priotiy to schedule different groups of services for different priorities.

anything that is 100% critical, which I always need access to, they are reserved a block of memory, and is set a higher IO and CPU priority. i'd whip up a script that sets these priorities every hour or so, so that children will inherit these priorities if their parent changes.

next is DNS, then Web, then Mail. in this order. this way, if something is misbehaving, more critical elements are favoured.

using monitor software, check if things are online, and if not, restart them. if anything has been using more than X MB of memory, for X Cycles... and you cannot connect to the service (i.e. on http://...:80) kill the service, and restart it. if it restarts more than X times in X cycles, time out (and notify for manual inspection). you might drop a few users occasionally, but atleast your system remains stable!

third, if you have a dedicated server, id put all website services on a separate disk. keep IO operations mainly over a different controller.

fourth, check out apache modules like mod_bw and mod_qos. mod_bw can do more than just limit bandwidth per virtualhost, and mod_qos... this is a quality of service module that can help mitigate some issues.

besides what you would expect from a full fledges QoS module, it can help with things like DoS preventing slowdos, limit NULL connections, and it can even turn off keepalive when the server reach a certain threshold of concurrent connections.

finally, I would set-up some intelligent caching front ends, or a load balancer. for example: using a few VM Instances, maybe use Varnish or NGinx, cache static files upstream. this will offload all the open slots Apache requires for serving that static content.

I'm really not sure what you expect to happen when you get alot of traffic. you want it to both remain stable, but you don't want to loose any functionality under stress, and you don't want to optimize anything, and you don't want to upgrade or extend your network?

well, if you don't want to CHANGE anything, how do you expect the problem to go away?

How to stop Apache from crashing my entire server?

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?