As Stack Overflow grows, we're starting to look closely at our IIS logs to identify problem HTTP clients -- things like rogue web spiders, users who have a large page set to refresh every second, poorly written one-off web scrapers, tricksy users who try to increment page count a zillion times, and so forth.
I've come up with a few LogParser queries that help us identify most of the oddities and abnormalities when pointed at an IIS log file.
Top bandwidth usage by URL
SELECT top 50 DISTINCT
SUBSTR(TO_LOWERCASE(cs-uri-stem), 0, 55) AS Url,
Count(*) AS Hits,
AVG(sc-bytes) AS AvgBytes,
SUM(sc-bytes) as ServedBytes
FROM {filename}
GROUP BY Url
HAVING Hits >= 20
ORDER BY ServedBytes DESC
url hits avgbyte served ------------------------------------------------- ----- ------- ------- /favicon.ico 16774 522 8756028 /content/img/search.png 15342 446 6842532
Top hits by URL
SELECT TOP 100
cs-uri-stem as Url,
COUNT(cs-uri-stem) AS Hits
FROM {filename}
GROUP BY cs-uri-stem
ORDER BY COUNT(cs-uri-stem) DESC
url hits ------------------------------------------------- ----- /content/img/sf/vote-arrow-down.png 14076 /content/img/sf/vote-arrow-up.png 14018
Top bandwidth and hits by IP / User-Agent
SELECT TOP 30
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
Count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent)
ORDER BY TotalBytes desc
client user-agent totbytes hits ------------- --------------------------------------------- --------- ----- 66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1; 135131089 16640 194.90.190.41 omgilibot/0.3++omgili.com 133805857 6447
Top bandwidth by hour by IP / User-Agent
SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY sum(sc-bytes) desc
hr client user-agent totbytes hits -- ------------- ----------------------------------------- -------- ---- 9 194.90.190.41 omgilibot/0.3++omgili.com 30634860 1549 10 194.90.190.41 omgilibot/0.3++omgili.com 29070370 1503
Top hits by hour by IP / User-Agent
SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
count(*) as Hits,
Sum(sc-bytes) AS TotalBytes
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY Hits desc
hr client user-agent hits totbytes -- ------------- ----------------------------------------- ---- -------- 10 194.90.190.41 omgilibot/0.3++omgili.com 1503 29070370 12 66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1 1363 13186302
The {filename} of course would be a path to an IIS logfile, such as
c:\working\sologs\u_ex090708.log
I did a lot of web searches for good IIS LogParser queries and found precious little. These 5, above, have helped us tremendously in identifying serious problem clients. But I'm wondering -- what are we missing?
What other ways are there to slice and dice the IIS logs (preferably with LogParser queries) to mine them for statistical anomalies? Do you have any good IIS LogParser queries you run on your servers?
A good indicator for hacking activies or other attacks is the number of errors per hour. The following script returns the dates and hours that had more than 25 error codes returned. Adjust the value depending on the amount of traffic on the site (and the quality of your web application ;-) ).
The result could something like this:
The next query detects an unusually high number of hits on a single URL from one IP address. In this example I chose 500, but you may have to change the query for edge cases (excluding the IP address of Google London for example ;-) .)
One thing you could consider to filter out legitimate traffic (and broaden your scope) is to enable
cs(Cookie)
in your IIS logs, add a bit of code that sets a small cookie using javascript, and addWHERE cs(Cookie)=''
.Because of your small bit of code, every user should have a cookie unless they manually disabled cookies (which a small percent of people might do) or unless that user is actually a bot that doesn't support Javascript (for example, wget, httpclient, etc. don't support Javascript).
I suspect that if a user has a high volume of activity, but they accept cookies and have javascript enabled, they are more likely to be a legitimate user, whereas if you find a user with a high volume of activity but no cookie/javascript support, they are more likely to be a bot.
Sorry, can't comment yet so I'm forced to answer.
There's a minor bug with the 'Top bandwidth usage by URL' query. While most of the time you'd be okay taking your requests for a page and multiplying by the file size, in this case, since you're not paying attention to any query parameters, you're going to run into some slightly-to-very inaccurate numbers.
For a more accurate value, just do a SUM(sc-bytes) instead of the MUL(Hits, AvgBytes) as ServedBytes.
Anders Lundström has been writing a series of blog articles regarding common LogParser queryies.
I've been using these:
This guy has about a dozen useful queries:
http://logparserplus.com/Examples/Queries.aspx
You may want to look for your longest requests (stems and/or queries), and the ones with most bytes received by the server. I'd also try one that groups by the bytes received and the IP, so that you can see if a particular request format that's likely repeated by one IP.
I'd also count the hits for either the group of requesting IP for an hour and minute of a day, or group the requesting IP with the minute of the hour to find if there's any regularly recurring visits that may be scripts. This would be a small modification on the hits by hour script.
On any non-programming sites, searching your logs for SQL keywords is also a good idea, things like
SELECT
,UPDATE
,DROP
,DELETE
and other oddities likeFROM sys.tables
, ORing that together and counting by IP would seem handy. For most sites including these, the words would rarely if ever appear in the query portion of the URI, but here they might legitimately appear in the URI stem and data parts. I like reversing the IPs of any hits just to see who's running premade scripts. I tend to see.ru
,.br
,.cz
and.cn
. I don't mean to judge, but I sort of tend to block them henceforth. In their defense, those countries are generally mostly populated, though I thus far I don't see much of say.in
,.fr
,.us
or.au
doing the same.P.S. I can't verify that these queries would actually run correctly. Please edit them freely if they need fixing.
These were all found here (which is an excellent guide for parsing your IIS logfiles, btw):
20 newest files on your website
logparser -i:FS "SELECT TOP 20 Path, CreationTime from c:\inetpub\wwwroot*.* ORDER BY CreationTime DESC" -rtp:-1
20 most recently modified files
logparser -i:FS "SELECT TOP 20 Path, LastWriteTime from c:\inetpub\wwwroot*.* ORDER BY LastWriteTime DESC" -rtp:-1
Files that have resulted in 200 status codes (in case trojans were deleted)
logparser "SELECT DISTINCT TO_LOWERCASE(cs-uri-stem) AS URL, Count() AS Hits FROM ex.log WHERE sc-status=200 GROUP BY URL ORDER BY URL" -rtp:-1
Show any IP address that hit the same page more than 50 times in a single day
logparser "SELECT DISTINCT date, cs-uri-stem, c-ip, Count() AS Hits FROM ex.log GROUP BY date, c-ip, cs-uri-stem HAVING Hits>50 ORDER BY Hits Desc" -rtp:-1
I don't know how to do it with LogParser but looking for strings of requests for things like "phpMyAdmin" (or other common vunerablities) that get 404s might be a good way to identify scripted attacks.