How can I use docker without sudo?

Question

Lynob

Asked: 2014-08-09 05:31:39 +0800 CST2014-08-09 05:31:39 +0800 CST 2014-08-09 05:31:39 +0800 CST

Searching for patterns in csv files

772

I have file.csv that looks like this

4,6,18,23,26
5,12,19,29,31
2,5,13,16,30
9,10,24,27,32
4,5,10,19,22
4,6,8,10,25
2,3,4,25,11

I want to find some patterns and save them in another log file file.log and remove them from the first file. Perl or grep ideally

for instance, if x+1 = x2, in range of 3, remove the row and log its existence in another file and where it existed. So then 2,4,5,25,11 will be removed from file.csv and in file.log I would find something like row 7: 2,3,4,25,11 was removed from file.csv. I'm trying to find sequences

3 Answers

Voted

Oli · Answer 1 · 2014-08-09T06:08:09+08:00

I think you need a heavier programming language for this. Python is my language of choice so here's a simple script with a simple example of a test:

import sys

tests = [
    lambda a, b, c, d, e: a+1==b and b+1==c and c+1==d and d+1==e,
]

with open(sys.argv[1]) as f:
    for line in f:
        if any(t(*map(int, line.split(','))) for t in tests):
            sys.stderr.write('Line removed: %s\n' % line)
            continue
        print line

That's obviously only a skeleton example of tests but it should be usable. Run normally it will output just the lines that don't match to STDOUT and the ones that do to STDERR. This makes it useful for redirecting into a new file.

Here it is in action:

$ python patterns.py <(echo -n 1,2,3,4,5)
Line removed: 1,2,3,4,5

$ python patterns.py <(echo -n 1,2,4,4,5)
1,2,4,4,5

Once you've loaded it up with patterns, you can just pass it the csv: python patterns.py input.csv

In terms of performance, Python isn't always the fastest. I use it because it's more than fast enough for web development and the time to write is much faster (which is what costs me time/money).

You can speed things up with PyPy. This is an alternative Python runtime that benchmarks amazingly well. You might not need the PPA version (Trusty ships 2.2, PPA is 2.3.1) but here's how you would:

sudo add-apt-repository ppa:pypy/ppa
sudo apt-get update
sudo apt-get install pypy

You'd then launch your script with pypy script.py or if you're executing it directly change the opening shebang to #!/usr/bin/env pypy. I've done some very simple testing on a 350000-line input file (your example repeated 50000 times) with the above script.

python2 ran it in 1.417s and pypy ran it in 0.645s. In my experience, you're likely going to see an even bigger improvement with more complicated algos.

... But yeah, none of this is going to beat the C/C++ equivalent. If the time it takes to run is money, spend some time reimplementing it in a faster language.

steeldriver · Answer 2 · 2014-08-09T06:37:23+08:00

Best Answer

steeldriver

2014-08-09T06:37:23+08:002014-08-09T06:37:23+08:00

If we interpret your requirement to mean that the value of the third field (column) should be one more than that of the second field (column), then with awk you can do things like

awk -F, '
$3==$2+1 {print "row "NR": "$0" was removed from "FILENAME > "file.log"; next}1
' file.csv > newfile.csv

which will create your file.log as specified and write the remaining lines to newfile.csv. You can rename newfile.csv to file.csv after to simulate deletion.

5

glenn jackman · Answer 3 · 2014-08-09T07:43:42+08:00

glenn jackman

2014-08-09T07:43:42+08:002014-08-09T07:43:42+08:00

perl:

$ perl -i.bak -F, -ane '
    if ($F[0]+1 == $F[1] and $F[1]+1 == $F[2]) {warn "row $.: $_"} else {print}
' file.csv 2>file.log
$ cat file.log
row 7: 2,3,4,25,11
$ cat file.csv
4,6,18,23,26
5,12,19,29,31
2,5,13,16,30
9,10,24,27,32
4,5,10,19,22
4,6,8,10,25

2

Searching for patterns in csv files

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?