From what I read here ECS capacity providers should (generally) prevent tasks immediately failing on resource limits by putting them in a "Provisioning" state and spinning up a new EC2 instance
This means, for example, if you call the RunTask API and the tasks don’t get placed on an instance because of insufficient resources (meaning no active instances had sufficient memory, vCPUs, ports, ENIs, and/or GPUs to run the tasks), instead of failing immediately, the task will go into the provisioning state (note, however, that the transition to provisioning only happens if you have enabled managed scaling for the capacity provider; otherwise, tasks that can’t find capacity will fail immediately, as they did previously).
I've setup an ECS cluster, with autoscaling group and ECS capacity provider in terraform. The autoscaling group is set with min_size = 1
and it immediately spins up a single instance... so I'm confident my launch configuration is fine.
However when I repeatedly call "RunTask" through the API (tasks with memory=128
), I get tasks failing to start immediately with reason RESOURCE:MEMORY
. Also no new instances start up.
I can't figure out what I've misconfigured.
This was all setup in terraform:
resource "aws_ecs_cluster" "ecs_cluster" {
name = local.cluster_name
setting {
name = "containerInsights"
value = "enabled"
}
tags = var.tags
capacity_providers = [aws_ecs_capacity_provider.capacity_provider.name]
# I added this in an attempt to make it spin up new instance
default_capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.capacity_provider.name
}
}
resource "aws_ecs_capacity_provider" "capacity_provider" {
name = "${var.tags.PlatformName}-stack-${var.tags.Environment}"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.autoscaling_group.arn
managed_termination_protection = "DISABLED"
managed_scaling {
maximum_scaling_step_size = 4
minimum_scaling_step_size = 1
status = "ENABLED"
target_capacity = 100
}
}
tags = var.tags
}
#Compute
resource "aws_autoscaling_group" "autoscaling_group" {
name = "${var.tags.PlatformName}-${var.tags.Environment}"
# If we're not using it, lets not pay for it
min_size = "1"
max_size = var.ecs_max_size
launch_configuration = aws_launch_configuration.launch_config.name
health_check_grace_period = 60
default_cooldown = 30
termination_policies = ["OldestInstance"]
vpc_zone_identifier = local.subnets
protect_from_scale_in = false
tag {
key = "Name"
value = "${var.tags.PlatformName}-${var.tags.Environment}"
propagate_at_launch = true
}
tag {
key = "AmazonECSManaged"
value = ""
propagate_at_launch = true
}
dynamic "tag" {
for_each = var.tags
content {
key = tag.key
propagate_at_launch = true
value = tag.value
}
}
enabled_metrics = [
"GroupDesiredCapacity",
"GroupInServiceInstances",
"GroupMaxSize",
"GroupMinSize",
"GroupPendingInstances",
"GroupStandbyInstances",
"GroupTerminatingInstances",
"GroupTotalInstances",
]
}
It looks like this down to a mistake I made while executing "RunTask" in the API (documented here). I had specified
launchType
and notcapacityProviderStrategy
.From the RunTask documentation:
It seems the result of this is that tasks will start if there is capacity, but will immediately fail if there is insufficient capacity and not give auto-scaling a chance to respond.
I was able to get it to work simply by deleting
launchType
becausedefault_capacity_provider_strategy
was set on the cluster.