Is there a way to detect memory fragmentation on Linux? This is because on some long running servers I have noticed performance degradation and only after I restart process I see better performance. I noticed it more when using Linux huge page support -- are huge pages in Linux more prone to fragmentation?
I have looked at /proc/buddyinfo
in particular. I want to know whether there are any better ways(not just CLI commands per se, any program or theoretical background would do) to look at it.
I am answering to the linux tag. My answer is specific only to Linux.
Yes, huge pages are more prone to fragmentation. There are two views of memory, the one your process gets (virtual) and the one the kernel manages (real). The larger any page, the more difficult it's going to be to group (and keep it with) its neighbors, especially when your service is running on a system that also has to support others that by default allocate and write to way more memory than they actually end up using.
The kernel's mapping of (real) granted addresses is private. There's a very good reason why userspace sees them as the kernel presents them, because the kernel needs to be able to overcommit without confusing userspace. Your process gets a nice, contiguous "Disneyfied" address space in which to work, oblivious of what the kernel is actually doing with that memory behind the scenes.
The reason you see degraded performance on long running servers is most likely because allocated blocks that have not been explicitly locked (e.g.
mlock()
/mlockall()
orposix_madvise()
) and not modified in a while have been paged out, which means your service skids to disk when it has to read them. Modifying this behavior makes your process a bad neighbor, which is why many people put their RDBMS on a completely different server than web/php/python/ruby/whatever. The only way to fix that, sanely, is to reduce the competition for contiguous blocks.Fragmentation is only really noticeable (in most cases) when page A is in memory and page B has moved to swap. Naturally, re-starting your service would seem to 'cure' this, but only because the kernel has not yet had an opportunity to page out the process' (now) newly allocated blocks within the confines of its overcommit ratio.
In fact, re-starting (lets say) 'apache' under a high load is likely going to send blocks owned by other services straight to disk. So yes, 'apache' would improve for a short time, but 'mysql' might suffer .. at least until the kernel makes them suffer equally when there is simply a lack of ample physical memory.
Add more memory, or split up demanding
malloc()
consumers :) Its not just fragmentation that you need to be looking at.Try
vmstat
to get an overview of what's actually being stored where.Kernel
To get current fragmentation index use:
To defragment kernel memory try executing:
Also you try turning off Transparent Huge Pages (aka THP) and/or disable swap(or decrease
swappiness
).Userspace
To reduce userspace fragmentation you may want to try different allocator, e.g.
jemalloc
(it has great introspection capabilities, which will give you an inside into allocator internal fragmentation).You can switch to custom malloc by recompiling your program with it or just by running your program with
LD_PRELOAD
:LD_PRELOAD=${JEMALLOC_PATH}/lib/libjemalloc.so.1 app
(beware of interactions between THP and memory memory allocators)Although, slightly unrelated to memory fragmentation(but connected to memory compaction/migration), you probably want to run multiple instances of your service, one for each NUMA node and bind them using
numactl
.Using huge pages should not cause extra memory fragmentation on Linux; Linux support for huge pages is only for shared memory (via shmget or mmap), and any huge pages used must be specifically requested and preallocated by a system admin. Once in memory, they are pinned there, and are not swapped out. The challenge of swapping in huge pages in the face of memory fragmentation is exactly why they remain pinned in memory (when allocating a 2MB huge page, the kernel must find 512 contiguous free 4KB pages, which may not even exist).
Linux documentation on huge pages: http://lwn.net/Articles/375098/
There is one circumstance where memory fragmentation could cause huge page allocation to be slow (but not where huge pages cause memory fragmentation), and that's if your system is configured to grow the pool of huge pages if requested by an application. If /proc/sys/vm/nr_overcommit_hugepages is greater than /proc/sys/vm/nr_hugepages, this might happen.
There is
/proc/buddyinfo
which is very useful. It's more useful with a nice output format, like this Python script can do:https://gist.github.com/labeneator/9574294
For huge pages you want some free fragments in the 2097152 (2MiB) size or bigger. For transparent huge pages it will compact automatically when the kernel is asked for some, but if you want to see how many you can get, then as root run:
Also yes, huge pages cause big problems for fragmentation. Either you cannot get any huge pages, or their presence causes the kernel to spend a lot of extra time trying to get some.
I have a solution that works for me. I use it on a couple of servers and my laptop. It works great for virtual machines.
Add the
kernelcore=4G
option to your Linux kernel command line. On my server I use 8G. Be careful with the number, because it will prevent your kernel from allocating anything outside of that memory. Servers that need a lot of socket buffers or that stream disk writes to hundreds of drives will not like being limited like this. Any memory allocation that has to be "pinned" for slab or DMA is in this category.All of your other memory then becomes "movable" which means it can be compacted into nice chunks for huge page allocation. Now transparent huge pages can really take off and work as they are supposed to. Whenever the kernel needs more 2M pages it can simply remap 4K pages to somewhere else.
And, I'm not totally sure how this interacts with zero-copy direct IO. Memory in the "movable zone" is not supposed to be pinned, but a direct IO request would do exactly that for DMA. It might copy it. It might pin it in the movable zone anyway. In either case it probably isn't exactly what you wanted.