I have 3 instances(node-0
, node-1
, node-2
) running 2 services - one is a websocket
and the other one an API
(both services run in each instance).
Target Group Setup:
Target Group | Instance | Health Check Path |
---|---|---|
api-node-0 | node-0 | /some-path/api/v1/ping |
api-node-1 | node-1 | /some-path/api/v1/ping |
api-node-2 | node-2 | /some-path/api/v1/ping |
websocket-node-0 | node-0 | /some-path/websocket/v1/ping |
websocket-node-1 | node-1 | /some-path/websocket/v1/ping |
websocket-node-2 | node-2 | /some-path/websocket/v1/ping |
Listener and Rules:
HTTPS:443 Listener
Rules:
api
- Condition: Path
/some-path/api/*
- Action: Forward to target group:
- api-node-0 (33.33%)
- api-node-1 (33.33%)
- api-node-2 (33.33%)
- Stickiness: Off
websocket
- Condition: Path
/some-path/websocket/*
- Action: Forward to target group:
- websocket-node-0 (33.33%)
- websocket-node-1 (33.33%)
- websocket-node-2 (33.33%)
- Stickiness: Off
default
- Condition: No other rule applies
- Action: Forward to target group:
- api-node-0 (100%)
Health Check attributes:
- Interval: 30 seconds
- Timeout: 5 seconds
- Healthy threshold: 2
- Unhealthy threshold: 2
- Healthy threshold: 2 consecutive health check successes
- Unhealthy threshold: 2 consecutive health check failures
- Success codes: 200
Load Balancer attributes:
- HTTP client keepalive duration: 3600 seconds
- Connection idle timeout: 60 seconds
- X-Forwarded-For header: Append
- Cross-zone load balancing: On
P.S. If you need any more information regarding the setup please let me know.
During normal testing where all target groups are healthy the ALB seems to be operating as expected.
Issue arises when I want to simulate a scenario when one of the services on a node becomes unhealthy, I changed the health check path of i.e api-node-1
, it shows up as unhealthy (Error 404
)but traffic is still being send to it. Confirmed both via Access logs and CloudWatch Metrics (RequestCountPerTarget
).
I also tried as a simulation of an unhealthy group to block the access of the ALB by removing the relevant security group from the instance. (Error 400
)
Testing methods (with unhealthy target group): Using curl (10-20 times) or a Grafana k6 Load Test and monitored traffic both in Access Logs and Cloudwatch - traffic was still being routed to all the instances and one of them was shown as unhealthy.
You can find another question that discussed this issue linked here.