I recently migrated a WordPress website I inherited over to AWS from Rackspace and am experiencing large performance decreases. I'm pretty new to the whole DevOps thing so I'm not sure where to start looking.
The problem:
After I point the DNS (Cloudflare) from our old server to the new one, WordPress (specifically in the admin section) things are find for a short bit. But after an hour or so admin page loads timeout, or take up to 30 seconds to load. I haven’t tested all pages just yet, but this seems to happen on pages with editors, so “new post” or editing a page. When the page times out, I see a Cloudflare timeout message.
I should note that I am making this switch at around 11pm at night, no one but me should be logged in, and web traffic is at about 150 users.
I have also noticed that our RDS instance starts hitting 100% CPU usage, and the DB connections skyrocket up to about 1000 connections.
Because RDS has WELL surpassed it's max_connections limit, WordPress can no longer connect to the database and now the front end of the site shows the "unable to establish connection to database" message.
At this point, I could restart my ECV2 instances, but the RDS database is still sorting out the 1000 connections.
I have also noticed that it seems the (classic) Elastic Load Balancer stops distributing traffic evenly across both instances.
I'll have another go at it this weekend, but what should I be looking for in the logs? Before the EC2 instances shut down I tailed the logs and al I saw was:
pid 17186:tid 139743773734656] (70007)The timeout specified has expired: [client 127.0.0.1:58604] AH01075: Error dispatching request to : (polling), referer: http://m.facebook.com
Spec overview:
Server - On AWS we are using 2 to 6 load balanced/autoscaled m3.large EC2 servers, managed by Elastic Beanstalk (for the Github integration) and a classic load balancer, Cloudflare is used for the DNS and SSL termination.
We are using Apache 2.4 with PHP 5.6, and PHP-FPM all on a 64bit Amazon Linux/2.7.1 AMI
RDS - R3.Large running Aurora and part of a cluster with a read replica running. I tried using an R3 Large a couple days ago, and admin page loads were 15-30 seconds… Still very slow.
I should also mention that RDS was setup on it's, outside of Elastic Beanstalk. I don't think that should matter tho. There are however 2 other databases on that server, for a couple smaller sites that get basically no traffic and will be decommissioned soon.
I have enabled object caching via W3TC, and I have added some Cloudflare rules to disable performance and apps for /wp-admin* as recommended here
A couple things I've read around the internet
- Maybe the ELB timeout limit is lower than the Apache limit and thats causing issues
- This article suggests changing my MPM from event to prefork or worker
It looks like you have 2 separate issues:
1) Connections take 30s to complete.
2) Exceeding 1000 connections limit for db.r3.large RDS aurora instance after which new connections will time out since php can no longer establish new sessions to RDS.
1st looks a lot like DNS name resolution issue. Check how your database connections are configured (IP versus FQDN). If it's FQDN - check your /etc/nsswitch.conf and check your cloudflare. You want to make sure that forward and reverse name resolution works as it should and this 30 second delay is not caused by that. You can also do tcpdump port 53 to check what's going on with the name resolution.
For 2nd you need to figure out why number of connections exceeds 1000.
If you are not using RDS Aurora, what is your "normal" number of connections? Depending on which DB is used there will be different queries to check for that. If it's normally consuming more than 1000 connections - then you would have to adjust your RDS instance accordingly (or re-engineer your app, perhaps you are using word press plugin that drives that number of connections high).
If on non RDS database number of connections is significantly lower than 1000 - then you would have to troubleshoot what's causing those extra connections.
Few links to start:
OK I found the answer. I was introduced to the tool MyTop which is basically Top for MYSQL queries. Thanks to that tool, I was able to to see that there was a single query running thousands and thousands of times and wound up just choking everything.
After I identified the query, I jumped onto new relic and using their database stack trace was able to find which php file was running the code that made the request and it was there I discovered a while loop which was out of control. I am unsure why that loop wasn't a problem on the old server, but I commented that code out and now AWS runs like a dream.