Which shell command is the fastest way to parse through millions of lines of text. Currently I'm using GREP in my script but it takes hours upon hours to finish.
Sample Input:
May 1 2014 00:00:00 Allow
May 1 2014 00:00:00 Allow
May 1 2014 01:00:00 Deny
May 1 2014 01:00:00 Deny
May 1 2014 02:00:00 Allow
May 1 2014 02:00:00 Deny
Sample Output:
(where "2" in line one is 'grep -c "allow" ' and "0" is 'grep -c "deny" ')
May 1 2014 00:00:00,2,0
May 1 2014 01:00:00,0,2
May 1 2014 02:00:00,1,1
Move away from regular expressions. They're slow (in every language) and they're far more than we need here for what amounts to simple substring comparisons.
I've implemented that idea below in Python:
No idea how that performs but give it a shot. If that's slower, convert it to C++ (a bit more of a PITA, which is why I'm using Python!) and that should rip through your data. It's not tough programming but it's what's required for optimal speed.
A little refactoring, harder to port unless you have an equivalent to
defaultdict
:And a Python implementation of a hybrid of steeldriver's and my ideas. This is probably the most memory efficient you'll get and it's using substring comparison rather than a regex extraction so should be nippy. It required sorted input though.
Benchmarking
In the interest of getting some of this tested (for my own curiosity, as much as anything else) I decided to do a little benchmarking on a 2,400,000 record file that covers 2400 separate dates.
I used the following Python script to generate a big file with random Allow/Deny endings:
This was about a thousand times faster than the Bash equivalent (see the revision log) and gives us a diverse log file to play with. It's unsorted so the two benchmarks that need collated input (3 and 4 below) are using a separate sorted version (
sort file > file-sorted
which took 0m4.253s to complete).defaultdict
: 0m1.413sawk
: 0m5.168s + 0m4.253s sortingI repeated the generation with 2.4million distinct dates (should push my first two to their limits). This sort took 0m6.999s. I've also added
pypy
timings for the Python versions.pypy
)pypy
)pypy
)Analysis and results
pypy
helps on larger keysets.awk
largely because we're not regexingIt's hard to guess whether it might be more efficient since you haven't posted enough detail about what you're doing now, but if the data is sorted in timestamp order I would suggest an algorithm something like
Allow
andDeny
counts until the time stamp changes (or we reach the end of input)In awk, you could do that as something like