I've got a logfile in the standard syslog format. It looks like this, except with hundreds of lines per second:
Jan 11 07:48:46 blahblahblah...
Jan 11 07:49:00 blahblahblah...
Jan 11 07:50:13 blahblahblah...
Jan 11 07:51:22 blahblahblah...
Jan 11 07:58:04 blahblahblah...
It doesn't roll at exactly midnight, but it'll never have more than two days in it.
I often have to extract a timeslice from this file. I'd like to write a general-purpose script for this, that I can call like:
$ timegrep 22:30-02:00 /logs/something.log
...and have it pull out the lines from 22:30, onward across the midnight boundary, until 2am the next day.
There are a few caveats:
- I don't want to have to bother typing the date(s) on the command line, just the times. The program should be smart enough to figure them out.
- The log date format doesn't include the year, so it should guess based on the current year, but nonetheless do the right thing around New Year's Day.
- I want it to be fast -- it should use the fact that the lines are in order to seek around in the file and use a binary search.
Before I spend a bunch of time writing this, does it already exist?
Update: I've replaced the original code with an updated version with numerous improvements. Let's call this (actual?)alpha-quality.
This version includes:
try
blocksOriginal text:
Well what do you know? "Seek" and ye shall find! Here is a Python program that seeks around in the file and uses a more-or-less binary search. It's considerably faster than that AWK script that other guy wrote.
It's (pre?)alpha-quality. It should have
try
blocks and input validation and lots of testing and could no doubt be more Pythonic. But here it is for your amusement. Oh, and it's written for Python 2.6.New code:
This will print the range of entries between a start time and an end time based on how they relate to the current time ("now").
Usage:
Example:
The
-l
(long) option causes the longest possible output. The start time will be interpreted as yesterday if the hours and minutes value of the start time is less than both the end time and now. The end time will be interpreted as today if both the start time and end time HH:MM values are greater than "now".Assuming that "now" is "Jan 11 19:00", this is how various example start and end times will be interpreted (without
-l
except as noted):Almost all of the script is setup. The last two lines do all the work.
Warning: no argument validation or error checking is done. Edge cases have not been thoroughly tested. This was written using
gawk
other versions of AWK may squawk.I think AWK is very efficient at searching through files. I don't think anything else is necessarily going to be any faster at searching an unindexed text file.
From a quick search on the net, there are things that extract based upon keywords (like FIRE or such :) but nothing that extracts a date range from the file.
It does not seem hard to do what you propose:
Seems straight forward, and I could write it for you if you don't mind Ruby :)
A C++ program applying a binary search - it would need some simple modifications (i.e. calling strptime) to work with text dates.
http://gitorious.org/bs_grep/
I had a previous version with support for text dates, however it was still too slow for the scale of our log files; profiling said that over 90% of the time was spent in strptime, so, we just modified the log format to include a numeric unix timestamp as well.
Even though this answer is way too late, it might be beneficial to some.
I've converted the code from @Dennis Williamson into a Python class that can be used for other python stuff.
I've added support for multiple date supports.