I'm trying to understand how memory requests and limits work with cgroup v2. In the Kubernetes manifest we can configure memory request and memory limit. Those values are then used to configure the cgroup interface:
- memory.min is set to memory request
- memory.max is set to memory limit
- memory.high is set to memory limit * 0.8, unless memory request == limit, in which case memory.high remains unset
- memory.low is always unset
memory.max is pretty self explanatory: When a process in the cgroup tries to allocate a page and this would put the memory usage over memory.max and not enough pages can be reclaimed from the cgroup to satisfy the request within the memory.max, then the OOM killer is invoked to terminate a process inside the cgroup. memory.high is more difficult to understand: The kernel documentation says that the cgroup is put under "high reclaim pressure" when the high watermark is reached, but what exactly does this mean?
Later on it says:
When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer.
Am I correct to assume this means that when the cgroup tries to allocate a page beyond the memory.high watermark, it will synchronously look at the lruvecs and try to reclaim as many pages from the end of the lists until it is back under the high watermark? Or is the "reclaim pressure" something that happens asynchronously (through kswapd)?
Question 2: What is even the point of using memory.high on Kubernetes? As far as I know, Kubernetes nodes typically run without swap space. The only pages that are reclaimable are anonymous pages (if there is enough swap available) and page cache. Since there is no swap, this only leaves page cache. The thing is that page cache would also be reclaimed when hitting memory.max, before invoking the OOM killer as a last resort if nothing can be reclaimed. Then memory.high is essentially useless:
As long as page cache is used, it can always be reclaimed and memory.max would do so, too. With memory.high we are just throttling the application earlier than we have to. Might as well set memory.max lower in the first place.
If no significant page cache is used (which is probably the case for the majority of applications running Kubernetes today), then nothing can be reclaimed ergo there is no throttling (no paging out unused anonymous memory, no thrashing visible in the pressure stall information that would warn us) and we will run into memory.max none the wiser. Using memory.high has no effect.