So I've got a website up and running with nginx/php-fpm/ubuntu
It works really well (and fast) and uses hardly any memory. My client started an ad campaign yesterday, and there were a couple times where for five or ten minutes at a time the website wasn't loading. I'm highly doubtful it was traffic overload, since statistics show there weren't very many visitors so far.
During these "outages" I would connect via ssh and run htop to see resource statistics. Processors (all of them) were around 0%, and ram was like 350mb out of 1024mb, and no swap.
I looked at the access logs really briefly and didn't see a whole lot there, though I did notice a couple bots poking around. I'm doubtful it's their fault since there's not a whole lot there (On a side note, what's a good way to "consume" simple text log files?)
What are all the steps to debugging this?
The first step would be to isolate where the failure is happening. It sounds like you were able to connect to the server during the outage, so it seems unlikely to me that there was a general server failure or a server-local network problem.
The first thing I would do if my web browser was unable to bring up the page would be to establish if port 80 is responding to connection attempts. The easiest way to do that is to use
telnet
, eg (assuming you're using something Unix-like):Try it out with servers you know are working to see what a successful message looks like. For www.google.com, eg, I get:
(To exit from telnet in this state, you need to hit Ctrl-], then Enter, then Ctrl-D.)
Failures you might see include DNS failure:
In which case you would follow up by trying to connect to the IP address.
Another failure possibility is a refused or timed-out connection:
This generally means either the server or a load balancer in between you and the server is not listening on the correct port. You might also see:
Which means the server doesn't exist at the address you thought it did, or there's a network routing problem in between.
You should first test this out from outside the network the server is on, preferably somewhere several ISPs disconnected. Then try it from the local network. Then try it from the local machine, using "localhost" in place of the hostname, assuming your web server is set to listen to loopback connections.
Once you know the pattern of the failures, then you can start trying to figure out where the failure is happening. My gut instinct is that your nginx or FastCGI is the root of the problem rather than some intermittent network problem that doesn't affect SSH traffic, but it's not really possible to troubleshoot further without first addressing the network question.
Hope this gives you some ideas of what to start with next time. Good luck.
Update
I just noticed your side question re the best way to "consume" log files. If you are in the middle of troubleshooting the problem, I recommend using
tail
. Open up two ssh sessions on the server, and in onetail -f /var/log/nginx/access_log
and in the othertail -f /var/log/nginx/error_log
(or whatever the paths are on your system).If you need to dig through a dense log file after the fact, a good tool to start with is
less
. Just runless /var/log/nginx/error_log
, and then press space to page down,b
to page up,/
to initiate a search, after whichn
will find the next search result andN
will find the previous result, and useq
to exit back to the shell.I would guess there are better tools specific to particular types of logs, but
tail
andless
usually get me about 90% of what I need when troubleshooting my logs.You should use IP addresses external to your location, like proxies or something. You can try to utilize Tor network for this kind of testing. First thing is to check if the site is accessible from various places in the Internet. Probably, DNS records were changed recently and haven't propagated yet.
You've not provided any information about the how the server is configured / where its hosted. There are all sorts of things which might be affecting this - e.g. network connection problems, cpu contention issues on a virtual machine.
I assume you've got error logging configured correctly and have checkde there was no change in the pattern of errors during these outages.
There's probably not a lot you can do to analyse what happened in the previous event - but do look to see if there has been a variation in response times.
Going forward you might consider setting up iptables to log the start of every tcp handshake on port 80, and start writing %D to the logfiles. Then look to see if there's slow responses / gaps between syn packets and completed responses.
If the system is giving a consistent delay between the syn cookie and the response, then the problem is not with the software running on the machine.
Running external (http) and internal (just a daemon which writes something to a log file then sleeps for a shoer interval) heartbeats against the server might be a good idea too. Again if you see issues on the external heartbeat but not the internal, it points to a network problem, if you see gaps in both, then there's a problem with the hardware of the server itself.
Consider adding a client-side performance agent such as boomerang to log page response times too.