I'm trying to understand how memory requests and limits work with cgroup v2. In the Kubernetes manifest we can configure memory request and memory limit. Those values are then used to configure the cgroup interface:
- memory.min is set to memory request
- memory.max is set to memory limit
- memory.high is set to memory limit * 0.8, unless memory request == limit, in which case memory.high remains unset
- memory.low is always unset
memory.max is pretty self explanatory: When a process in the cgroup tries to allocate a page and this would put the memory usage over memory.max and not enough pages can be reclaimed from the cgroup to satisfy the request within the memory.max, then the OOM killer is invoked to terminate a process inside the cgroup. memory.high is more difficult to understand: The kernel documentation says that the cgroup is put under "high reclaim pressure" when the high watermark is reached, but what exactly does this mean?
Later on it says:
When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer.
Am I correct to assume this means that when the cgroup tries to allocate a page beyond the memory.high watermark, it will synchronously look at the lruvecs and try to reclaim as many pages from the end of the lists until it is back under the high watermark? Or is the "reclaim pressure" something that happens asynchronously (through kswapd)?
Question 2: What is even the point of using memory.high on Kubernetes? As far as I know, Kubernetes nodes typically run without swap space. The only pages that are reclaimable are anonymous pages (if there is enough swap available) and page cache. Since there is no swap, this only leaves page cache. The thing is that page cache would also be reclaimed when hitting memory.max, before invoking the OOM killer as a last resort if nothing can be reclaimed. Then memory.high is essentially useless:
As long as page cache is used, it can always be reclaimed and memory.max would do so, too. With memory.high we are just throttling the application earlier than we have to. Might as well set memory.max lower in the first place.
If no significant page cache is used (which is probably the case for the majority of applications running Kubernetes today), then nothing can be reclaimed ergo there is no throttling (no paging out unused anonymous memory, no thrashing visible in the pressure stall information that would warn us) and we will run into memory.max none the wiser. Using memory.high has no effect.
I dont think it will immediately go into direct reclaim (synchronous as you call it) at that point, but I'm not sure. In my experience it will eventually hit direct reclaim with memory.high demands are stretched too far. It will certainly push up the memory pressure regardless.
Running without swap space generally is stupid and has been for a long time. Regardless however, the only pages that are reclaimable are indeed mostly in page cache. There are other strategies that might happen though.
But its slim pickings.
In general, your observations match my realities too when not having swap to evict anonymous pages - MemoryHigh makes thrashing a lot worse as you just keep your page cache to an absolute minimum and end up doing IO a lot.
We turn it off also on LXD/LXC instances as it causes unnecessary thrashing (its a hard limit in the code we have to go back later to 'fix').
MemoryLow however can be useful as a soft reservation mechanism to say to the kernel "dont rob pages from this control group below this memory range, choose another victim".