I have a bare-metal (kubeadm) kubernetes cluster that's really unstable, and I traced it back to an etcd issue.
From the etcd pod's description I get:
Image: k8s.gcr.io/etcd:3.4.13-0
Liveness: ... #success=1 #failure=8
Startup: ... #success=1 #failure=24
In the logs startup sequence seems fine (compared to another cluster), then I get a lot of warnings:
etcdserver: [...] request ... took too long to execute
But I don't think it's hardware related because etcd_disk_backend_commit_duration_seconds
99th percentile is at 16ms which is fine according to the FAQ.
Anyways, this goes on for a few minutes, and then I guess this causes the restart:
etcdserver/api/etcdhttp: /health error; QGET failed etcdserver: request timed out (status code 503)
Any idea what further steps I can take to diagnose the issue and fix etcd ?