I maintain a Gentoo server with a few services, including Apache. It's fairly low-end (2GB of RAM and a low-end CPU with 2 cores). My problem is that, despite my best efforts, an over-loaded Apache crashes the entire server. In fact, at this point I'm close to being convinced that Linux is a horrible operating system that isn't worth anyone's time looking for stability under load.
Things I tried:
- Adjusting oom_adj for the root Apache process (and thus all its children). That had close to no effect. When Apache was overloaded it would bring the system to a grind, as the system paged out everything else before it got to kill anything.
- Turning off swap. Didn't help, it would unload memory paged to binaries of processes and other files on /, thus causing the same effect.
- Putting it in a memory-limited cgroup (limited to 512 MB of RAM, 1/4th of the total). This "worked", at least in my own stress tests - except the server keeps crashing under load (basically stalling all other processes, inaccessible via SSH, etc.)
- Running it with idle I/O priority. This wasn't a very good idea in the end, because it just caused the system load to climb indefinitely (into the thousands) with almost no visible effect - until you tried to access an unbuffered part of the disk. This caused the task to freeze. (So much for good I/O scheduling, eh?)
- Limiting the number of concurrent connections to Apache. Setting the number too low caused web sites to become unresponsive due to most slots being occupied with long requests (file downloads).
- I tried various Apache MPMs without much success (prefork, event, itk).
- Switching from prefork/event+php-cgi+suphp to itk+mod_php. This improved performance, but didn't solve the actual problem.
- Switching I/O schedulers (cfq to deadline).
Just to stress this out: I don't care if Apache itself goes down under load, I just want the rest of my system to remain stable. Of course, having Apache recover quickly after a brief period of intensive load would be great to have, but one step at a time.
Right now I am mostly dumbfounded by how can humanity, in this day and age, design an operating system where such a seemingly simple task (don't allow one system component to crash the entire system) seems practically impossible - or at least, very hard to do.
Please don't suggest things like VMs or "BUY MORE RAM".
Some more information gathered with a friend's help: The processes hang when the cgroup oom killer is invoked. Here's the call trace:
[<ffffffff8104b94b>] ? prepare_to_wait+0x70/0x7b [<ffffffff810a9c73>] mem_cgroup_handle_oom+0xdf/0x180 [<ffffffff810a9559>] ? memcg_oom_wake_function+0x0/0x6d [<ffffffff810aa041>] __mem_cgroup_try_charge+0x32d/0x478 [<ffffffff810aac67>] mem_cgroup_charge_common+0x48/0x73 [<ffffffff81081c98>] ? __lru_cache_add+0x60/0x62 [<ffffffff810aadc3>] mem_cgroup_newpage_charge+0x3b/0x4a [<ffffffff8108ec38>] handle_mm_fault+0x305/0x8cf [<ffffffff813c6276>] ? schedule+0x6ae/0x6fb [<ffffffff8101f568>] do_page_fault+0x214/0x22b [<ffffffff813c7e1f>] page_fault+0x1f/0x30
At this point, the apache memory cgroup is practically deadlocked, and burning CPU in syscalls (all with the above call trace). This seems like a problem in the cgroup implementation...
I hate to say it, but you appear to be asking the wrong question.
It's not about stopping Apache from bringing down your server, it's about having your webserver serve more queries per second - enough so that you don't have a problem. A part of the answer to the reframed question is then limiting Apache so that it does not crash at high loads.
For the second part of that, Apache has some limits you can set - MaxClients being an important configuration. This limits how many children it's allowed to run. If you can take load off Apache for long-running processes (large files being downloaded for example), that's another slot in Apache to be able to serve PHP. If the file downloads have to be verified by the PHP layer, they can still do that, and pass back out to a more optimised webserver for the static content, such as with NginX sendfile
Meanwhile, forking Apache every on every single request for the slowest way to run PHP - as a CGI (whatever apache MPM you may be using) - is also having the machine spend large amounts of time not running your code. mod_php is significantly more optimised.
PHP can do huge amounts of traffic when Apache and the PHP layer are appropriately optimised. Yesterday, 11th Dec 2010, for example, the pair of PHP servers that I run did almost 19 Million hits in the 24hr period, and most of that in the 7am-8pm time-period.
There are plenty of other questions here, and articles elsewhere about optimising Apache and PHP, I think you need to read them first, before blaming Linux/Apache & PHP.
When you are dealing with a production apache server, you MUST have an average process size, especially with php, I'll recommend you to:
MaxClients
to AVERAGE_MEMORY / RAM_DEDICATED_TO_APACHEWhere RAM_DEDICATED_TO_APACHE it must be another estimation of the TOTAL_RAM minus the ram that needs the rest of the machine (and be generous with the database if you are running one in the same machine).
I really recommend you to use Varnish, you can easily run 2 servers on different ports on the save machine, and route the static files to an specialized file (media) server (lighthttpd, nginx) or an apache instance with worker and no extra modules. And of course catch the static content with varnish.
Split the load is important because you will be using the same amount of ram to deliver any static file (which needs less than 1MB) if you don't do it.
If you really need to make sure to never consume all the ram, you can install a new cronjob running each 2 minutes (less or more as you consider) with the following line, adjusting the
50
to any amount of the lowest ram, and keep this number above 30 at least; you'll need some ram to stop the server.This is a very hakish (dirty) way of limit you ram, but it can be very helpful when you are not really sure about your average memory per apache process, and if you see several reboots in you log file ("/var/log/apache-reboots.log"), the you should tune your apache
MaxClients
,MaxRequestsPerChild
,ThreadsPerChild
to avoid futures hard-reboots, with the time and tunning, you will have the exact configuration for your server.A few general things you can try:
Have you tried changing /proc/sys/vm/overcommit_memory to 2 ? This means the kenel will not allocate morememory than swap + a configurable percentage (proc/sys/vm/overcommit_ratio) of available ram.
In the case Apache will just fail as it't can't allocate the ram but services already loaded such as openSSH will continue to function.
I should add I have never tried this and just discovered this setting now. I would love to hear from any one who knows more. Otherwise I will test this tomorrow as I have exactly the same problem as described in the question.
I found the problem...
Setting
oom_adj
to 15 for the whole memory-limited cgroup turned out to be very stupid. The adjusted score of all processes in the cgroup ended up all being 1000 - so when the cgroup ran out of memory, the system killed random processes and generally misbehaved.I haven't had any system crashes after simply removing the line that set
oom_adj
.this may be a little bit late, but I can say that blaming the OS is simply not the way to go. the OS is designed to meet the expectations of several different use case scenarios, therefore, you MUST configure it to meet your requirements.
not only this, but if you are having so much load that the system is crashing, then you have to optimize your system, or expand your network.
while over-optimizing too early can make things painful later on, not optimizing anything at all from the very beginning can have the very same consequences. it's all about balance.
however, you claim your goal is to prevent the system from crashing.. but then go on to say your solutions did not work. but some of them did work, you just were not happy with the results.
when you run out of memory, you swap. or things crash. end of story. if you don't want to swap, you have to:
without careful optimization, fine tuning, and expansion.. you cannot prevent all these things from happening.
in my experience, I learned that by using a granular mix of all the above generally caused things to work out in the end.
first off, I use
apache2 + mpm_event + mod_fcgid
. i'd carefully configure just about every possible option apache has to configure. this might take one evening to do, and another to get right. but it will be worth it.I'd ensure that there is always one pool of workers ready to handle incoming connections, and let it grow, but cap this pool at some reasonable limit. this may sacrifice some speed, but results in stability.
second, I use both
CGroups and IO Priority / CPU Priotiy
to schedule different groups of services for different priorities.anything that is 100% critical, which I always need access to, they are reserved a block of memory, and is set a higher IO and CPU priority. i'd whip up a script that sets these priorities every hour or so, so that children will inherit these priorities if their parent changes.
next is DNS, then Web, then Mail. in this order. this way, if something is misbehaving, more critical elements are favoured.
using
monitor software,
check if things are online, and if not, restart them. if anything has been using more than X MB of memory, for X Cycles... and you cannot connect to the service (i.e. onhttp://...:80
) kill the service, and restart it. if it restarts more than X times in X cycles, time out (and notify for manual inspection). you might drop a few users occasionally, but atleast your system remains stable!third, if you have a dedicated server, id put all website services on a
separate disk
. keep IO operations mainly over a different controller.fourth, check out apache modules like
mod_bw
andmod_qos
. mod_bw can do more than just limit bandwidth per virtualhost, and mod_qos... this is a quality of service module that can help mitigate some issues.besides what you would expect from a full fledges QoS module, it can help with things like DoS preventing slowdos, limit NULL connections, and it can even turn off keepalive when the server reach a certain threshold of concurrent connections.
finally, I would set-up some
intelligent caching front ends
, or aload balancer
. for example: using a few VM Instances, maybe use Varnish or NGinx, cache static files upstream. this will offload all the open slots Apache requires for serving that static content.I'm really not sure what you expect to happen when you get alot of traffic. you want it to both remain stable, but you don't want to loose any functionality under stress, and you don't want to optimize anything, and you don't want to upgrade or extend your network?
well, if you don't want to CHANGE anything, how do you expect the problem to go away?