I have file.csv that looks like this
4,6,18,23,26
5,12,19,29,31
2,5,13,16,30
9,10,24,27,32
4,5,10,19,22
4,6,8,10,25
2,3,4,25,11
I want to find some patterns and save them in another log file file.log
and remove them from the first file. Perl or grep ideally
- for instance, if x+1 = x2, in range of 3, remove the row and log its existence in another file and where it existed. So then
2,4,5,25,11
will be removed fromfile.csv
and infile.log
I would find something likerow 7: 2,3,4,25,11 was removed from file.csv
. I'm trying to find sequences
I think you need a heavier programming language for this. Python is my language of choice so here's a simple script with a simple example of a test:
That's obviously only a skeleton example of tests but it should be usable. Run normally it will output just the lines that don't match to STDOUT and the ones that do to STDERR. This makes it useful for redirecting into a new file.
Here it is in action:
Once you've loaded it up with patterns, you can just pass it the csv:
python patterns.py input.csv
In terms of performance, Python isn't always the fastest. I use it because it's more than fast enough for web development and the time to write is much faster (which is what costs me time/money).
You can speed things up with PyPy. This is an alternative Python runtime that benchmarks amazingly well. You might not need the PPA version (Trusty ships 2.2, PPA is 2.3.1) but here's how you would:
You'd then launch your script with
pypy script.py
or if you're executing it directly change the opening shebang to#!/usr/bin/env pypy
. I've done some very simple testing on a 350000-line input file (your example repeated 50000 times) with the above script.python2
ran it in 1.417s andpypy
ran it in0.645s
. In my experience, you're likely going to see an even bigger improvement with more complicated algos.... But yeah, none of this is going to beat the C/C++ equivalent. If the time it takes to run is money, spend some time reimplementing it in a faster language.
If we interpret your requirement to mean that the value of the third field (column) should be one more than that of the second field (column), then with
awk
you can do things likewhich will create your
file.log
as specified and write the remaining lines tonewfile.csv
. You can renamenewfile.csv
tofile.csv
after to simulate deletion.perl: