Imagine the following scenario:
- You run a kubernetes cluster in your datacenter, which was deployed with kubeadm.
- It consists of one masternode (running etcd as a static pod, as deployed by kubeadm) and 3 worker nodes
- the nodes as virtual machines running on vmware
Today, you open your e-mail and you are notified the datacenter will move to a new location. The physical servers will be turned off, moved to the new location and powered on again.
What is the correct shutdown procedure for your kubernetes cluster (without messing up your etcd data)?
This what I did:
- stopped the master server first (this includes etcd ofc), to prevent pods from being rescheduled to other nodes when I turn off the worker nodes.
- stopped each worker node
After the migration:
- powered on the worker nodes first
- powered on the master node next
After doing this, I ended up with one of two scenarios:
- etcd data is corrupt and the etcd pod exits with an error
- getting errors like this: "Operation cannot be fulfilled on nodes "worker-002": the object has been modified; please apply your changes to the latest version and try again". my logs are getting flooded with these messages.
How could this have been prevented? I don't think running etcd in HA mode would help here, as all etcd nodes would have to be shut down at once too, so you end up with a similar situation as a single node scenario. I get the impression that Etcd is quite... fragile, compared to other K/V stores like Consul.
You will need to stop on master
If you have federation also stop federation-apiserver
Run a backup(snapshot) of etcd and stop etcd when done
For each node stop
Etcd is as robust as consul, what do you mean by
instable
?!When restore though you have the etcd data, this is not valid immediately ... you should read on backups on kubernetes
In fact, etcd is rather resilient with it's journal based approach, but, as always, you should have a backup done just prior to the migration / shutdown, just to be on a safe side. If there is an issue with etcd, just recover the backup and you're good to go.
As you will restart your whole cluster, the order you do it is not really that important, all the containers will have to start again anyway, meaning kubelet will have to connect to a working API.
Where did you get this instable impression of etcd from, I have no idea.