We've been running into the OOM killer despite nearly half of our memory being used for the FS cache. We've been logging memory stats once per minute (as reported by top), but there seems to be plenty of availability.
...
Mem: 15339640k total, 15268304k used, 71336k free, 3152k buffers
Swap: 0k total, 0k used, 0k free, 6608384k cached
Mem: 15339640k total, 14855280k used, 484360k free, 13748k buffers
Swap: 0k total, 0k used, 0k free, 6481852k cached
[OOM killer: postgres killed]
Mem: 15339640k total, 8212200k used, 7127440k free, 32776k buffers
Swap: 0k total, 0k used, 0k free, 2394444k cached
...
OOM details from syslog:
...
Jun 10 05:45:25 db kernel: [11209156.840462] wal-e invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 10 05:45:25 db kernel: [11209156.840469] wal-e cpuset=/ mems_allowed=0
Jun 10 05:45:25 db kernel: [11209156.840474] Pid: 7963, comm: wal-e Not tainted 3.2.0-43-virtual #68-Ubuntu
Jun 10 05:45:25 db kernel: [11209156.840477] Call Trace:
Jun 10 05:45:25 db kernel: [11209156.840498] [<ffffffff81119711>] dump_header+0x91/0xe0
Jun 10 05:45:25 db kernel: [11209156.840502] [<ffffffff81119a95>] oom_kill_process+0x85/0xb0
Jun 10 05:45:25 db kernel: [11209156.840506] [<ffffffff81119e3a>] out_of_memory+0xfa/0x220
Jun 10 05:45:25 db kernel: [11209156.840511] [<ffffffff8111f823>] __alloc_pages_nodemask+0x8c3/0x8e0
Jun 10 05:45:25 db kernel: [11209156.840520] [<ffffffff81216e00>] ? noalloc_get_block_write+0x30/0x30
Jun 10 05:45:25 db kernel: [11209156.840528] [<ffffffff811566c6>] alloc_pages_current+0xb6/0x120
Jun 10 05:45:25 db kernel: [11209156.840534] [<ffffffff81116637>] __page_cache_alloc+0xb7/0xd0
Jun 10 05:45:25 db kernel: [11209156.840539] [<ffffffff81118602>] filemap_fault+0x212/0x3c0
Jun 10 05:45:25 db kernel: [11209156.840553] [<ffffffff81138c32>] __do_fault+0x72/0x550
Jun 10 05:45:25 db kernel: [11209156.840557] [<ffffffff8113c2ea>] handle_pte_fault+0xfa/0x200
Jun 10 05:45:25 db kernel: [11209156.840562] [<ffffffff8100638e>] ? xen_pmd_val+0xe/0x10
Jun 10 05:45:25 db kernel: [11209156.840567] [<ffffffff81005309>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Jun 10 05:45:25 db kernel: [11209156.840571] [<ffffffff8113d559>] handle_mm_fault+0x269/0x370
Jun 10 05:45:25 db kernel: [11209156.840576] [<ffffffff8100a56d>] ? xen_force_evtchn_callback+0xd/0x10
Jun 10 05:45:25 db kernel: [11209156.840581] [<ffffffff8100ad42>] ? check_events+0x12/0x20
Jun 10 05:45:25 db kernel: [11209156.840589] [<ffffffff8165b3cb>] do_page_fault+0x14b/0x520
Jun 10 05:45:25 db kernel: [11209156.840594] [<ffffffff81160d64>] ? kmem_cache_free+0x104/0x110
Jun 10 05:45:25 db kernel: [11209156.840600] [<ffffffff811ba2c8>] ? ep_remove+0xa8/0xc0
Jun 10 05:45:25 db kernel: [11209156.840604] [<ffffffff811bb133>] ? sys_epoll_ctl+0xb3/0x3d0
Jun 10 05:45:25 db kernel: [11209156.840614] [<ffffffff81658035>] page_fault+0x25/0x30
Jun 10 05:45:25 db kernel: [11209156.840617] Mem-Info:
Jun 10 05:45:25 db kernel: [11209156.840618] Node 0 DMA per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840622] CPU 0: hi: 0, btch: 1 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840624] CPU 1: hi: 0, btch: 1 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840627] CPU 2: hi: 0, btch: 1 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840629] CPU 3: hi: 0, btch: 1 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840631] Node 0 DMA32 per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840634] CPU 0: hi: 186, btch: 31 usd: 30
Jun 10 05:45:25 db kernel: [11209156.840637] CPU 1: hi: 186, btch: 31 usd: 47
Jun 10 05:45:25 db kernel: [11209156.840639] CPU 2: hi: 186, btch: 31 usd: 15
Jun 10 05:45:25 db kernel: [11209156.840641] CPU 3: hi: 186, btch: 31 usd: 2
Jun 10 05:45:25 db kernel: [11209156.840643] Node 0 Normal per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840646] CPU 0: hi: 186, btch: 31 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840648] CPU 1: hi: 186, btch: 31 usd: 14
Jun 10 05:45:25 db kernel: [11209156.840650] CPU 2: hi: 186, btch: 31 usd: 0
Jun 10 05:45:25 db kernel: [11209156.840653] CPU 3: hi: 186, btch: 31 usd: 1
Jun 10 05:45:25 db kernel: [11209156.840658] active_anon:3616567 inactive_anon:4798 isolated_anon:0
Jun 10 05:45:25 db kernel: [11209156.840660] active_file:98 inactive_file:168 isolated_file:20
Jun 10 05:45:25 db kernel: [11209156.840661] unevictable:1597 dirty:73 writeback:0 unstable:0
Jun 10 05:45:25 db kernel: [11209156.840662] free:16921 slab_reclaimable:17631 slab_unreclaimable:7534
Jun 10 05:45:25 db kernel: [11209156.840663] mapped:1614529 shmem:1613928 pagetables:124012 bounce:0
Jun 10 05:45:25 db kernel: [11209156.840666] Node 0 DMA free:7888kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7632kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840681] lowmem_reserve[]: 0 4016 15112 15112
Jun 10 05:45:25 db kernel: [11209156.840686] Node 0 DMA32 free:48368kB min:4176kB low:5220kB high:6264kB active_anon:3776804kB inactive_anon:28kB active_file:0kB inactive_file:20kB unevictable:932kB isolated(anon):0kB isolated(file):0kB present:4112640kB mlocked:932kB dirty:0kB writeback:0kB mapped:1458536kB shmem:1458632kB slab_reclaimable:17604kB slab_unreclaimable:8088kB kernel_stack:1872kB pagetables:190616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:437 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840698] lowmem_reserve[]: 0 0 11095 11095
Jun 10 05:45:25 db kernel: [11209156.840703] Node 0 Normal free:11428kB min:11548kB low:14432kB high:17320kB active_anon:10689464kB inactive_anon:19164kB active_file:528kB inactive_file:652kB unevictable:5456kB isolated(anon):0kB isolated(file):80kB present:11362176kB mlocked:5456kB dirty:292kB writeback:0kB mapped:4999580kB shmem:4997080kB slab_reclaimable:52920kB slab_unreclaimable:22048kB kernel_stack:2584kB pagetables:305432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1974 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840715] lowmem_reserve[]: 0 0 0 0
Jun 10 05:45:25 db kernel: [11209156.840720] Node 0 DMA: 2*4kB 3*8kB 1*16kB 3*32kB 3*64kB 3*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 7888kB
Jun 10 05:45:25 db kernel: [11209156.840752] Node 0 DMA32: 5813*4kB 2636*8kB 114*16kB 15*32kB 5*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 48372kB
Jun 10 05:45:25 db kernel: [11209156.840776] Node 0 Normal: 1888*4kB 10*8kB 46*16kB 4*32kB 3*64kB 2*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 11760kB
Jun 10 05:45:25 db kernel: [11209156.840788] 1615243 total pagecache pages
Jun 10 05:45:25 db kernel: [11209156.840790] 0 pages in swap cache
Jun 10 05:45:25 db kernel: [11209156.840801] Swap cache stats: add 0, delete 0, find 0/0
Jun 10 05:45:25 db kernel: [11209156.840803] Free swap = 0kB
Jun 10 05:45:25 db kernel: [11209156.840805] Total swap = 0kB
Jun 10 05:45:25 db kernel: [11209156.909794] 3934192 pages RAM
Jun 10 05:45:25 db kernel: [11209156.909804] 99282 pages reserved
Jun 10 05:45:25 db kernel: [11209156.909809] 18899146 pages shared
Jun 10 05:45:25 db kernel: [11209156.909811] 2198511 pages non-shared
Jun 10 05:45:25 db kernel: [11209156.909817] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Jun 10 05:45:25 db kernel: [11209156.909835] [ 332] 0 332 4308 109 1 0 0 upstart-udev-br
Jun 10 05:45:25 db kernel: [11209156.909845] [ 346] 0 346 5384 271 2 -17 -1000 udevd
Jun 10 05:45:25 db kernel: [11209156.909851] [ 408] 0 408 5364 174 2 -17 -1000 udevd
...
Jun 10 05:45:25 db kernel: [11209156.910703] [ 7963] 111 7963 17456 2966 0 0 0 wal-e
Jun 10 05:45:25 db kernel: [11209156.910707] [ 7968] 111 7968 1639372 2351 3 0 0 postgres
Jun 10 05:45:25 db kernel: [11209156.910711] [ 7969] 111 7969 1639371 1934 2 0 0 postgres
Jun 10 05:45:25 db kernel: [11209156.910716] Out of memory: Kill process 12443 (postgres) score 418 or sacrifice child
Jun 10 05:45:25 db kernel: [11209156.910733] Killed process 12443 (postgres) total-vm:6555152kB, anon-rss:4600kB, file-rss:6396572kB
Jun 10 05:45:30 db kernel: [11209159.293083] postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 10 05:45:31 db kernel: [11209159.293091] postgres cpuset=/ mems_allowed=0
Jun 10 05:45:31 db kernel: [11209159.293095] Pid: 6508, comm: postgres Not tainted 3.2.0-43-virtual #68-Ubuntu
Jun 10 05:45:31 db kernel: [11209159.293098] Call Trace:
Jun 10 05:45:31 db kernel: [11209159.293111] [<ffffffff81119711>] dump_header+0x91/0xe0
Jun 10 05:45:31 db kernel: [11209159.293115] [<ffffffff81119a95>] oom_kill_process+0x85/0xb0
Jun 10 05:45:31 db kernel: [11209159.293119] [<ffffffff81119e3a>] out_of_memory+0xfa/0x220
...
We can try increasing the resolution of these to roughly once per second, but would there be any reason for an OOM here? (We've seen http://bl0rg.krunch.be/oom-frag.html but we're working with much larger absolute amounts of memory, most of which is taken by the kernel's FS cache.)
Also including relevant parts of our postgresql.conf
below:
shared_buffers = 6GB
effective_cache_size = 8GB
For the love of everything good in the world, configure swap space on your servers.
You really need swap space. I'm not the only one who says so, it's pretty much a universal truth around here. (<-- Those are three links )
You should of course have enough RAM that your database server isn't swapping regularly -- if you don't the solution is money (which you take you your vendor and use to acquire more RAM).
Since you now have adequate RAM, and swap to use if something goes wrong, you can disable the OOM killer (by disabling memory overcommit), like the Postgres people tell you to.
(You can also apply their alternate solution and tell the OOM-Killer to never kill Postgres - but then you're just playing Russian Roulette with the rest of your system's processes...)
(optional) Write an answer on Server Fault detailing why the default behavior in most Linux distributions is Bad, Wrong, and violates the POSIX specification for how malloc() is supposed to behave. Repeat it until everyone is sick of hearing about it.
Also note that the kernel's cached memory is available to postgres (or any other application) to use - you should factor it as free/available memory in your calculations.
If I had to hazard a guess at what's happening here I'd say you've got a complex query, Postgres is requesting RAM to execute it, and rather than saying "I don't have that RAM" Linux tells Postgres "Sure, you can have it."
Then when Postgres actually tries to use the RAM it was (allegedly) given Linux realizes it doesn't HAVE the RAM it promised Postgres (because it's overcommitted) - the OOM killer is told to free up the RAM, and dutifully kills the program using the most memory -- your database server.
Postgres is a well-designed program. If it's told it can't have the RAM it's requesting it will handle that gracefully (either by making do with less, or aborting with a message to the user).
It appears you (and I in a case with very similar symptoms) have truly run out of memory and have been confused by the
cached
number.There apparently are cases when Linux not freeing large disk cache when memory demand goes up
In particular (I don't really understand why), postgres'
shared_buffers
may be reported under "Cached" (the page cache). In your case the6481852k cached
intop
matches this line in the OOM-killer's log:(1615243*4KB ~= 6481852k) - meaning the page cache indeed was not dropped before invoking OOM-killer.
Yet there are few file-backed pages (I assume
active_file:98 inactive_file:168
is similar to /proc/meminfo's Active(file)/Inactive(file)), so it's not the discardable pages we know and love.The post at https://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/ demonstrates an example session where shutting down postgres leads to reduction of "cached" by the size of
shared_buffers
(scroll to "And most of it came off disk cache – as expected, because it was used for shared_buffers.") - unfortunately it doesn't indicate the version of postgres nor the kernel that was used for the experiment.I'm using 3.13.0-67 x86_64 with PG 9.3. In 9.3 they switched from using Sys V shared memory (
shmget
) to anonymousmmap(...R+W, MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE...)+fork()
(in 9.4 this became configurable via dynamic_shared_memory_type). But I couldn't find any explanations as to why these mmap()s are supposed to show up in "cached" and why, only https://access.redhat.com/solutions/406773 that says "Cached: Memory in the pagecache (Diskcache and Shared Memory)"Given that there's many kinds of shared memory I'm both enlightened and confused...