We have an autoscaling group that spawns worker servers. Each worker server runs celery processes. We monitor the celery queue length using cloudwatch and depending on that queue length we spawn/kill auto scaling server. What you see in this answer is how we are doing it: Is there a way to use length of a RabbitMQ queue used by Celery to start instance in an autoscale group?
Our termination policy is to kill the oldest server first. This happens when the queue length is at zero for consistently 300 seconds.
The normal setup has 3 servers that are always available. The autoscaling group kicks in only when the queue length exceeds a certain number. Say there are 10 jobs in queue for consistently 30 seconds.
I have not set up any routing nor priority in my celery config.
Here is the problem. When the scale down occurs, I am not entirely sure if the host that is getting killed is free because all workers are treated equally. Tasks sometimes take up to 5-10 minutes and I do not want the server to be killed if it is in the middle of executing a task
I have not faced any problems so far. But I am afraid some of our customers might face a problem because of this
You can use a lifecycle event to do custom actions when the instance is in the "terminating:wait" state.
Create a lifecycle hook as per the steps on this page, copied below. In this state a script or Lambda can hold the instance open until all jobs are done. The page I linked to has additional information on cooldown periods.