Ping a Specific Port

Question

Kyle

Asked: 2010-12-22 10:47:42 +0800 CST2010-12-22 10:47:42 +0800 CST 2010-12-22 10:47:42 +0800 CST

Ways to parse NCSA combined based log files

772

I've done a bit of site: searching with Google on Server Fault, Super User and Stack Overflow. I also checked non site specific results and and didn't really see a question like this, so here goes...

I did spot this question, related to grep and awk which has some great knowledge but I don't feel the text qualification challenge was addressed. This question also broadens the scope to any platform and any program.

I've got squid or apache logs based on the NCSA combined format. When I say based, meaning the first n col's in the file are per NCSA combined standards, there might be more col's with custom stuff.

Here is an example line from a squid combined log:

1.1.1.1 - - [11/Dec/2010:03:41:46 -0500] "GET http://yourdomain.com:8080/en/some-page.html HTTP/1.1" 200 2142 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; C) AppleWebKit/532.4 (KHTML, like Gecko)" TCP_MEM_HIT:NONE

I'd like to be able to parse n logs and output specific columns, for sorting, counting, finding unique values etc

The main challenge and what makes it a little tricky and also why I feel this question hasn't yet been asked or answered, is the text qualification conundrum.

When I spotted asql from the grep/awk question, I was very excited but then realised that it didn't support combined out of the box, something I'll look at extending I guess.

Looking forward to answers, and learning new stuff! Answers doesn't have to be limited to platform or program/language. For the context of this question, the platforms I use the most are Linux or OSX.

Cheers

2 Answers

Voted

Kyle · Answer 1 · 2010-12-22T11:14:35+08:00

Best Answer

Kyle

2010-12-22T11:14:35+08:002010-12-22T11:14:35+08:00

Using Perl, tested on v5.10.0 built for darwin-thread-multi-2level (OSX)

To print the UserAgent column:

perl -n -e '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+) ([0-9\-]+) "(.*)" "(.*)"/; print "$11\n"' -- test.log

option -n while each line in test.log
option -e one line program

I stole and tweaked the perlre which I Googled from the PHP cookbook. I removed the $ from the end of the re to support custom formats based on NCSA combined. The pattern can be easily extended to provide more groups.

The regular expression groups () end up as local variables $1 to $n

Quick and dirty and very easy to extend and script.

Some examples of piping the output:

| sort | uniq unique column values
| sort | uniq | wc -l unique column count

Critique and improvements welcome

3

kerridge0 · Answer 2 · 2011-08-14T02:04:14+08:00

kerridge0

2011-08-14T02:04:14+08:002011-08-14T02:04:14+08:00

Although it doesn't directly address text qualification, one factor that can be taken advantage of in the combined format is that the remaining space delimited columns are consistently in the same column. You can therefore work around the problem by using a loop with printf and NF (number of columns)

According to awk, $0 is the entire input line, $1 is the first column, $2 is the second, and $NF is the last.

So for a standard NCSA combined, user agent is columns $13 through to column $NF

I needed to remove the first column and swap it with the last column of a modified log format (proxied IP was addied to the last column).

So what should be returned was the $NF column, followed by the second column ($2), and then the remaining columns through to NF - 1

I was able to do that with the following:-

awk '{ printf "%s ", $NF; for (i=2; i<=NF-1; i++) printf "%s ", $i; printf "\n";}' < /var/log/nginx/access.log

1

Ways to parse NCSA combined based log files

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?