The problem:
30-40 minutes after pointing the DNS from our old server towards our new server, all available memory gets used up and our (3) load balanced EC2 instances crash.
To make matters worse, it doesn't appear that Elastic Beanstalk is terminating instances which have crashed. I think that is because we can only select a single auto scaling trigger and memory usage is not one of the available triggers.
According to Chartbeat, our website seems to get 200-400 concurrent users (Google Analytics Real-Time reports shows 60-80 users).
I should also point out that I have "solved" the issue but installing Varnish on the EC2 instances. With Varnish in place, the servers do not crash and the frontend is very very fast. However, I wanted to know whether or not this was normal behavior for 200 users on 3 load balanced servers. I'm worried that there is something very wrong, or something that could be tweaked.
Spec overview:
On AWS we are using
- 3 to 6 load balanced and autoscaled t2.large EC2 servers (2 vCPUs and 8gigs memory)
- managed by Elastic Beanstalk (for the Github integration)
- a classic load balancer
- Cloudflare is used for the DNS and SSL termination.
- Apache 2.4
- PHP 5.6
- PHP FPM
- 64bit Amazon Linux/2.7.1 AMI
The configs for PHP and FPM are below:
What I have found
I am switching the DNS around 10pm EST when traffic is low (200 users according to Chartbeat and 60 according to GA) to test things and gather info.
After about 30-40 minutes, all memory gets used up. Unfortunately I wasn't monitoring closely enough to notice whether the memory use steadily increased or if it just spiked. You can see from the image that latency times exploded as well.
At this point I checked the logs and saw that the server reached it's max_children setting:
[19-Sep-2018 22:50:40] NOTICE: fpm is running, pid 6842
[19-Sep-2018 22:50:40] NOTICE: ready to handle connections
[19-Sep-2018 23:03:21] NOTICE: Reloading in progress ...
[19-Sep-2018 23:03:21] NOTICE: reloading: execvp("php-fpm-5.6", {"php-fpm-5.6"})
[19-Sep-2018 23:03:21] NOTICE: using inherited socket fd=9, "/var/run/php-fpm/php5-fpm.sock"
[19-Sep-2018 23:03:21] NOTICE: using inherited socket fd=9, "/var/run/php-fpm/php5-fpm.sock"
[19-Sep-2018 23:03:21] NOTICE: fpm is running, pid 8293
[19-Sep-2018 23:03:21] NOTICE: ready to handle connections
[19-Sep-2018 23:33:01] WARNING: [pool www] server reached max_children setting (200), consider raising it
I should probably increase the max_children back up to 420 from 200. But I guess I wasn't realizing what max_children does (it handles each request right? And each page view could request multiple images, css, a php file, JS calls ect ect?).
But I was hoping that 3 EC2 servers would be able to handle this load. Especially considering that the current, older infrastructure (Rackspace) is basically 2 servers: 1 varnish cache and 1 server that serves the frontend of the site. Neither of those servers seem beefier than the new AWS servers, they only have 4 gigs memory. The PHPFPM configs are also much lower on that server:
pm = dynamic
pm.max_children = 20
pm.start_servers = 8
pm.min_spare_servers = 5
pm.max_spare_servers = 10
And thats whats crazy to me. How can the old server (plus varnish cache) with lower specs and lower fpm settings handle all this traffic, but my 3 to 6 load balanced EC2 servers cant?
Next Steps
- Maybe EC2 servers just really suck compared to the old Rackspace server and I need to choose larger instances?
- The RDS database is a big bottleneck, and until I adjusted it's settings wouldn't allow more than 40 connections. Maybe I need to use an EC2 server running mysql? (I have another, separate but related question open about this)
- Memcache or Redis via elasticache might help so long as I can ensure it doesn't interfere with the admin section.
- Opcache is enabled by default in php5.6 but is there anything else I need to do use it?
- Add memory monitoring and additional autoscale triggers to elastic beanstalk
A cache hit is extremely fast, possibly 100x faster than generating dynamic content. Hits remove unnecessary duplicate work from the backend.
To compare the hosting providers, you need to compare similar designs. One with a cache and one without will have very different performance characteristics.
That health monitor screenshot shows relatively low CPU use and run queue (load average), but high request latency. Look at
/proc/meminfo
to confirm it is under memory pressure. If memory is the limiting factor, more workers will hurt and not help.Regarding the scaling triggers, use something other than memory to limit connections per instance. Perhaps network traffic, or number of requests.