The OOM killer on Linux wreaks havoc with various applications every so often, and it appears that not much is really done on the kernel development side to improve this. Would it not be better, as a best practice when setting up a new server, to reverse the default on the memory overcommitting, that is, turn it off (vm.overcommit_memory=2
) unless you know you want it on for your particular use? And what would those use cases be where you know you want the overcommitting on?
As a bonus, since the behavior in case of vm.overcommit_memory=2
depends on vm.overcommit_ratio
and swap space, what would be a good rule of thumb for sizing the latter two so that this whole setup keeps working reasonably?
An interesting analogy (from http://lwn.net/Articles/104179/):
The OOM killer only wreaks havoc if you have overloaded your system. Give it enough swap, and don't run applications that suddenly decide to eat massive amounts of RAM, and you won't have a problem.
To specifically answer your questions:
brk
(2) (and the wrappers that use it, such asmalloc
(3)) returning an error. When I experimented with this at my previous job, it was deemed to be more of a hassle to get everything capable of handling out-of-memory errors than it was just to deal with the consequences of an OOM (which, in our case, was far worse than having to restart the occasional service if an OOM occured -- we had to reboot an entire cluster, because GFS is a steaming pile of faeces).Basically, my experience is that turning off overcommit is a nice experiment that rarely works as well in practice as it sounds in theory. This nicely corresponds with my experiences with other tunables in the kernel -- the Linux kernel developers are almost always smarter than you, and the defaults work the best for the vast, vast majority of cases. Leave them alone, and instead go find what process has the leak and fix it.
Hmm, I'm not fully convinced by arguments in favour of overcommit and OOM killer... When womble writes,
"The OOM killer only wreaks havoc if you have overloaded your system. Give it enough swap, and don't run applications that suddenly decide to eat massive amounts of RAM, and you won't have a problem."
He's about describing an environment scenario where overcommit and OOM killer are not enforced, or don't 'really' act (if all applications allocated memory as needed, and there were enough virtual memory to be allocated, memory writes would closely follow memory allocations without errors, so we couldn't really speak about an overcommited system even if an overcommit strategy were enabled). That's about an implicit admission that overcommit and OOM killer works best when their intervention is not needed, which is somehow shared by most supporters of this strategy, as far as I can tell (and I admit I cannot tell much...). Morover, referring to applications with specific behaviours when preallocating memory makes me think that a specific handling could be tuned at a distribution level, instead of having a default, systemwise approach based on heuristics (personally, I believe that heuistic is not a very good approach for kernel stuff)
For what concern the JVM, well, it's a virtual machine, to some extent it needs to allocate all the resources it needs on startup, so it can create its 'fake' environment for its applications, and keep its available resource separated from the host environment, as far as possible. Thus, it might be preferable to have it failing on startup, instead of after a while as a consequence of an 'external' OOM condition (caused by overcommit/OOM killer/whatever), or anyway suffering for such a condition interfering with its own internal OOM handling strategies (in general, a VM should get any required resources from the beginning and the host system should 'ignore' them until the end, the same way any amount of physical ram shared with a graphics card is never - and cannot be - touched by the OS).
About Apache, I doubt that having the whole server occasionally killed and restarted is better than letting a single child, along with a single connection, fail from its (= the child's/the connection's) beginning (as if it were a whole new instance of the JVM created after another instance run for a while). I guess the best 'solution' might depend on a specific context. For instance, considering an e-commerce service, it might be far preferable to have, sometimes, a few connections to shopping chart failing randomly instead of loosing the whole service, with the risk, for instance, to interrupt an ongoing order finalization, or (maybe worse) a payment process, with all consequences of the case (maybe harmless, but maybe harmfull - and for sure, when problems arose, those would be worse then an unreproducible error condition for debugging purposes).
The same way, on a workstation the process which consumes the most resources, and so tailing to be a first choice for the OOM killer, could be a memory intensive application, such as a video transcoder, or a rendering software, likely the only application the user wants to be untouched. This considerations hints me that the OOM killer default policy is too aggressive. It uses a "worst fit" approach which is somehow similar to that of some filesystems (the OOMK tries and free as much memory as it can, while reducing the number of killed subprocesses, in order to prevent any further intervention in short time, as well as a fs can allocate more disk space then actually needed for a certain file, to prevent any further allocation if the file grew and thus preventing fragmentation, to some extent).
However, I think that an opposite policy, such as a 'best fit' approach, could be preferable, so to free the exact memory being needed at a certain point, and not be bothering with 'big' processes, which might well be wasting memory, but also might not, and the kernel cannot know that (hmm, I can imagine that keeping trace of page accesses count and time could hint if a process is allocating memory it doesn't need any more, so to guess whether a process is wasting memory or just using a lot, but access delays should be weighted on cpu cycles to distinguish a memory wasting from a memory and cpu intensive application, but, whereas potentially inaccurate, such an heuristics could have an excessive overhead).
Moreover, it might not be true that killing the fewer possible processes is always a good choise. For instance, on a desktop environment (let's think of a nettop or a netbook with limited resources, for sample) a user might be running a browser with several tabs (thus, memory consuming - let's assume this is the first choise for the OOMK), plus a few other applications (a word processor with non saved data, a mail client, a pdf reader, a media player, ...), plus a few (system) daemons, plus a few file manager instances. Now, an OOM error happens, and the OOMK chooses to kill the browser while the user is doing something deemed 'important' over the net... the user would be disappointed. On the other hand, closing the few file manager's instances being in idle could free the exact amount of memory needed while keeping the system non only working, but working in a more reliable way.
Anyway, I think that the user should be enabled to take a decision on his own on what's to do. In a desktop (=interactive) system, that should be relatively quite easy to do, provided enough resources are reserved to ask the user to close any application (but even closing a few tabs could be enough) and handle his choise (an option could consist of creating an additional swap file, if there is enough space). For services (and in general), I'd also consider two further possible enhancements: one is logging OOM killer intervents, as well as processes starting/forking failures in such a way the failure could be easily debugged (for instance, an API could inform the process issuing the new process creation or forking - thus, a server like Apache, with a proper patch, could provide a better logging for certain errors); this could be done indepenently from the overcommit/OOMK being in effort; in second place, but not for importance, a mechanism could be established to fine-tune the OOMK algorithm - I know it is possible, to some extent, to define a specific policy on a process by process basis, but I'd aim a 'centralised' configuration mechanism, based on one or more lists of application names (or ids) to identify relevant processes and give them a certain degree of importance (as per listed attributes); such a mechanism should (or at least could) also be layered, so that there could be a top-level user-defined list, a system- (distribution-) defined list, and (bottom-level) application-defined entries (so, for instance, a DE file manager could instruct the OOMK to safely kill any instance, since a user can safely reopen it to access the lost file view - whereas any important operation, such as moving/copying/creating data around could be delegated to a more 'privileged' process).
Morover, an API could be provided in order to allow applications to rise or lower their 'importance' level at run-time (with respect to memory management purposes and regardless execution priority), so that, for instance, a Word processor could start with a low 'importance' but rise it as some data is holded before flushing to a file, or a write operation is being performed, and lower importance again once such operation ends up (analogously, a file manager could change level when it passed from just liting files to dealing with data and viceversa, instead of using separate processes, and Apache could give different levels of importance to different children, or change a child state according to some policy decided by sysadmins and exposed through Apache's - or any other kind of server's - settings). Of course, such an API could and would be abused/misused, but I think that's a minor concern compared to the kernel arbitrarily killing processes to free memory without any relevant information on what's going on the system (and memory consumption/time of creation or the alike aren't enough relevant or 'validating' for me) - only users, admins and program writers can really determine whether a process is 'still needed' for some reason, what the reason is, and/or if the application is in a state leading to data loss or other damages/troubles if killed; however, some assumption could yet be made, for instance looking for resources of a certain kind (file descriptors, network sockets, etc.) acquired by a process and with pending operations could tell if a process should be in a higher 'state' than the one set, or if its 'self-established' one is higher than needed and can be lowered (aggressive approach, unless superseded by user's choises, such as forcing a certain state, or asking - through the lists I mentioned above - to respect application's choises).
Or, just avoid overcommitting and let the kernel do just what a kernel must do, allocating resources (but not rescuing them arbitrarily as the OOM killer does), scheduling processes, preventing starvations and deadlocks (or rescuing from them), ensuring full preemption and memory spaces separation, and so on...
I'd also spend some more words about overcommit approaches. From other discussions I've made the idea that one of the main concerns about overcommit (both as a reason to want it and as a source of possible troubles) consists of forks handling: honestly, I don't know how exactly the copy-on-write strategy is implemented, but I think that any aggressive (or optimistic) policy might be mitigated by a swap-alike locality strategy. That is, instead of just cloning (and adjusting) a forked process code pages and scheduling structures, a few other data pages could be copied before an actual write, choosing among those pages the parent process has accessed for writing more frequently (that is, using a counter for write operations).
Everything, of course, IMHO.
Credit:- Linux kernel is starting the OOM killer