Since Kubernetes 1.8, it seems I need to disable swap on my nodes (or set --fail-swap-on
to false
).
I cannot find the technical reason why Kubernetes insists on the swap being disabled. Is this for performance reasons? Security reasons? Why is the reason for this not documented?
The idea of kubernetes is to tightly pack instances to as close to 100% utilized as possible. All deployments should be pinned with CPU/memory limits. So if the scheduler sends a pod to a machine it should never use swap at all. You don't want to swap since it'll slow things down.
Its mainly for performance.
TL;DR not properly using swap is just a lazy hack that demonstrates a poor understanding of the memory subsystems and a lack of fundamental systems administration skills. Designing infrastructure services and not understanding these systems is bound to end in failure.
So, I've got some commentary on this, this seems more like laziness to me rather than a feature or requirement. It's absolutely possible to properly handle swap, analyze the memory, and determine how to properly utilize the memory subsystem without hitting swap. There are a litany of tools built around this and you can guarantee a process will not utilize swap quite easily so the point of performance is incorrect. It's simply lazy coding to not put this instrumentation in, and overall the complete removal of swap will be to the detriment of system performance. The key here is using it properly. I'll agree that swapping out pods to disks will impact performance, however there are a number of things that should be swapped out to disk.
Additionally the linux kernel is designed to utilize swap, and completely disabling it is going to have negative consequences. A better way to handle this would be to pin the pods into main memory and not allow them to swap to disk, reduce the vfs cache pressure so that it does not swap unless it is absolutely necessary, and even then you could cause pinned processes to fail MALLOC in event that main memory is exhausted.
Depending on the processes in the containers having a hard failure of the container or having it killed by OOM killer could result in some pretty disastrous outcomes. I understand however that the processes run in these containers should ideally be stateless and ephemeral, but in 20 years of running systems, I have not once seen everyone follow the intended design to the letter 100% of the time.
Furthermore this doesn't take into account future technologies such as non volatile memory, and newer memory systems like intel xpoint which can be used to extend main memory significantly using hybrid disk/memory systems. With these type of systems they can use them directly as supplemental main memory or utilize swap files to extend main memory with negligible performance impact.
The reason for this, as I understand it, is that the kubelet isn't designed to handle swap situations and the Kubernetes team aren't planning to implement this as the goal is that pods should fit within the memory of the host.
from this GitHub issue #53533
There is ticket to enable it again you'll get more insight there
https://github.com/kubernetes/kubernetes/issues/53533