How can I use docker without sudo?

Question

Roboman1723

Asked: 2014-06-14 09:29:38 +0800 CST2014-06-14 09:29:38 +0800 CST 2014-06-14 09:29:38 +0800 CST

Speed up script

772

Which shell command is the fastest way to parse through millions of lines of text. Currently I'm using GREP in my script but it takes hours upon hours to finish.

Sample Input:

May  1 2014 00:00:00 Allow
May  1 2014 00:00:00 Allow
May  1 2014 01:00:00 Deny
May  1 2014 01:00:00 Deny
May  1 2014 02:00:00 Allow
May  1 2014 02:00:00 Deny

Sample Output:

(where "2" in line one is 'grep -c "allow" ' and "0" is 'grep -c "deny" ')

May 1 2014 00:00:00,2,0
May 1 2014 01:00:00,0,2
May 1 2014 02:00:00,1,1

2 Answers

Voted

Oli · Answer 1 · 2014-06-14T12:38:26+08:00

Move away from regular expressions. They're slow (in every language) and they're far more than we need here for what amounts to simple substring comparisons.

Take a substring of 0:20 as the first key
Take the substring of 21:22 (single char) as the boolean outcome for the second key
The value of that combination should be an integer that you just increment each time you see it.

I've implemented that idea below in Python:

data = {}
with open("file", "r") as f:
    for line in f:
        key = line[0:20]
        allowed = int(line[21:22] != "A")

        if not key in data:
            data[key] = [0,0]
        data[key][allowed] += 1

for key, val in data.items():
    print('%s,%d,%d' % (key, val[0], val[1]))

No idea how that performs but give it a shot. If that's slower, convert it to C++ (a bit more of a PITA, which is why I'm using Python!) and that should rip through your data. It's not tough programming but it's what's required for optimal speed.

A little refactoring, harder to port unless you have an equivalent to defaultdict:

from collections import defaultdict

data = defaultdict(lambda: [0,0])
with open("file", "r") as f:
    for line in f:
        key = line[0:20]
        allowed = int(line[21:22] != "A")
        data[key][allowed] += 1

for key, val in data.items():
    print('%s,%d,%d' % (key, val[0], val[1]))

And a Python implementation of a hybrid of steeldriver's and my ideas. This is probably the most memory efficient you'll get and it's using substring comparison rather than a regex extraction so should be nippy. It required sorted input though.

last = ""
score = [0,0]

with open("file", "r") as f:
    for line in f:
        key = line[0:20]
        allowed = int(line[21:22] != "A")

        if key and key != last:
            print '%s,%d,%d' % (last, score[0], score[1])
            score = [0,0]
            last = key

        score[allowed] += 1

print '%s,%d,%d' % (last, score[0], score[1])

Benchmarking

In the interest of getting some of this tested (for my own curiosity, as much as anything else) I decided to do a little benchmarking on a 2,400,000 record file that covers 2400 separate dates.

I used the following Python script to generate a big file with random Allow/Deny endings:

import itertools, datetime, random

CHOICES = ['Allow', 'Deny']

with open("file", "w") as f:
    for _ in itertools.repeat(None, 2400):
        epoch = random.randint(1, 1404226041)
        d = datetime.datetime.fromtimestamp(epoch)
        print d
        dstr = d.strftime('%b %d %Y %X')

        for _ in itertools.repeat(None, 1000):
            f.write('%s %s\n' % (dstr, CHOICES[random.randint(0,1)]))

This was about a thousand times faster than the Bash equivalent (see the revision log) and gives us a diverse log file to play with. It's unsorted so the two benchmarks that need collated input (3 and 4 below) are using a separate sorted version (sort file > file-sorted which took 0m4.253s to complete).

My first: 0m1.544s
My refactor with defaultdict: 0m1.413s
Steeldriver's awk: 0m5.168s + 0m4.253s sorting
My Python reimplementation of #3: 0m1.489s + 0m4.253s sorting

I repeated the generation with 2.4million distinct dates (should push my first two to their limits). This sort took 0m6.999s. I've also added pypy timings for the Python versions.

0m11.589s (0m7.080s in pypy)
0m11.307s (0m7.087s in pypy)
0m8.652s + 0m6.999s
0m6.586s + 0m6.999s (0m1.566s in pypy)

Analysis and results

On small keysets, 1 and 2 both perform best. pypy helps on larger keysets.
4 is faster than 3's awk largely because we're not regexing
4 is fastest and has the lowest footprint but only if the data comes pre-sorted
2 is fastest if we have jumbled data
External sorting is really slow.

steeldriver · Answer 2 · 2014-06-14T12:29:30+08:00

steeldriver

2014-06-14T12:29:30+08:002014-06-14T12:29:30+08:00

It's hard to guess whether it might be more efficient since you haven't posted enough detail about what you're doing now, but if the data is sorted in timestamp order I would suggest an algorithm something like

accumulate the Allow and Deny counts until the time stamp changes (or we reach the end of input)
print the result and (if we haven't reached end of input) reset the counts

In awk, you could do that as something like

awk '

FNR==1 {t = $1FS$2FS$3FS$4}

$1FS$2FS$3FS$4 == t {
  a += $5=="Allow" ? 1 : 0
  d += $5=="Deny" ? 1 : 0
}

$1FS$2FS$3FS$4 != t {
  printf "%s,%d,%d\n",t,a,d
  a = $5=="Allow" ? 1 : 0
  d = $5=="Deny" ? 1 : 0
  t = $1FS$2FS$3FS$4
}

END {printf "%s,%d,%d\n",t,a,d}

' input

1

Speed up script

Benchmarking

Analysis and results

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?