Question
I am suspicious of an unexplained 1600% increase in traffic and massive slow-down that lasted about 10 minutes. I'm not sure if it was an attempted DoS attack, dictionary login attack, etc. Regardless, what actions should I take to monitor my server (which logs should I look at, what tools should I use, etc.) to make sure nothing nefarious happened? What steps should I take during future slowdowns such as these? Is there a standard way to have the server alert me during such a surge in traffic?
All the gory details:
One of my clients reported an unresponsive website (Ruby on Rails via Apache, Mongrel, and mongrel_cluster on a CentOS 5 box.) around 1:00 today.
I was in full troubleshooting mode when I got the email at 1:15. It was indeed exceptionally slow to ssh
and load web pages, but ping
output looked fine (78 ms), and traceroute
from my workstation in Denver showed slow times on a particular hop mid-way from Dallas to the server in Phoenix (1611.978 ms 195.539 ms). 5 minutes later, the website was responsive & traceroute
was now routing through San Jose to Phoenix. I couldn't find anything obviously wrong on my end--the system load looked quite reasonable (0.05 0.07 0.09) and I assumed it was just a networking problem somewhere. Just to be safe, I rebooted the machine anwyay.
Several hours later, I logged to Google Analytics to see how things looked for the day. I had a huge spike in hits: Usually this site averages 6 visits/hour, but at 1:00 I got 130 (a 1600% increase)! Nearly all of these hits appear to come from 101 different hosts spread across the world. Each visitor was on the website for 0 seconds and each visit was direct (i.e. it's not like the web page got slashdotted) and each visit was a bounce.
Ever since about 1:30, things are running smooth and I'm back to the average 6 visits per hour.
Disclaimer:
I am a code developer (not a sysadmin) who must maintain web servers for machines that run the code that I write.
it's unclear what you were pinging/tracing and from where. But if that was a hop in a middle of a traceroute's output, then jump from 190 ms to 1600 ms probably means network congestion. If this correlates to your event and switching of a routing path, it is possible that a part of your providers network was attacked including your server.
There is no single solution to your problem. There are many tools and approaches, like Scout, Keynote, New Relic, Nagios, etc. It all depends. Whatever you decide to do, just don't forget one thing, that if you monitor something on a server and from that server, and that server becomes unavailable you loose any means to notify yourself that it is down :)
I would look to see if the connections were coming from some kind of web crawler. There has been a spike in the amount of connections that come from applications like http://www.majestic12.co.uk/
This particular service acts like SETI@Home or Folding@Home and aggregates crawled data from the distributed users back to a central location. Majestic12 uses the following browser agents: http://www.majestic12.co.uk/projects/dsearch/mj12bot.php
Majestic, however, does follow the rules configured in robots.txt, so you can block it from crawling your site, and there are also similar crawlers that work in this distributed fashion.
To determine if this was the case, you can look at your web logs to try and identify the user agent that the connection was made with. While this isn't always reported correctly, it should give and indication if the traffic did indeed come from some kind of bot.
If you discover that the connections did come from some kind of webcrawler, you try and restrict access to it using a robots.txt file. If they all came from a particular user agent, you ask they not crawl your site with something similar to file below.