I have been running below script on a Red Hat server, and it works fine and finishes the job. The file I am feeding it, contains half a million lines in it (approximately 500000 lines), and that's why (to finish it faster) I have added an '&' at the end of while loop block
But now I have setup a Desktop with 8 GB of RAM running Ubuntu 18.04 on it, and running the same code only finishes a few thousand lines and then hangs. I read a bit about it and increased the stack limit to unlimited as well and still it hung after 80000 lines or so, Any suggestions about how can I optimize the code or tune my PC parameters to always finish the job?
while read -r CID60
do
{
OLT=$(echo "$CID60" | cut -d"|" -f5)
ONID=${OLT}:$(echo "$CID60" | cut -d, -f2 | sed 's/ //g ; s/).*|//')
echo $ONID,$(echo "$CID60" | cut -d"|" -f3) >> $localpath/CID_$logfile.csv
} &
done < $localpath/$CID7360
Input:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN45| Unlocked|12-654-0330|Up|202-00_MSRFKH00OL6|P282018767.C2028 ( network, R1.S1.LT7.PON8.ONT81.SERV1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN46| Unlocked|12-654-0330|Down|202-00_MSRFKH00OL6|P282017856.C881 ( local, R1.S1.LT7.PON8.ONT81.C1.P1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN52| Unlocked|12-664-1186|Up|202-00_MSRFKH00OL6|P282012623.C2028 ( network, R1.S1.LT7.PON8.ONT75.SERV1 )|
output:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
my output of interest is 5th column ( separated with pipe |
) being concatenated with part of last column, and then the third column
A pure sed solution:
-k
keep the order, so the first/last line of the input will also be the first/last line of the output--pipepart
splits the file on the fly--block -1
into 1 chunk per CPU thread-a input.txt
the file to splitdoit
the command (or bash function) to callSpeedwise the
parallel
(yellow) version outperforms thetr
(black) around 200 MB on my system (Seconds vs MB):Oneliners by me and other persons as well as some scripts tested
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with
perl
. See the end of this answer.Test of oneliners and small scripts
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge
infile
by 'doubling 20 times' from your demoinput
(1572864 lines), so some margin to your 500000 lines,Oneliner with
cut
andsed
:Timing
We might expect, that a pure
sed
solution would be faster, but I think that reordering of the data slows it down, so that thecut
andsed
solution is faster. Both solutions work without any problem in my computer.Oneliner with
cut
andsed
:A pure
sed
oneliner by xenoid:A
python
script using a regex with non-greedy matches by xeniod:A
perl
oneliner by PerlDuck is faster than the previous oneliners:Oneliner with
tr
andcut
with atr -s
command:I used
tr
to convert the spaces in the input file to pipeline characters and thencut
could do it all withoutsed
. As you can see,tr
is much faster thansed
. Thetr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.Oneliner with
tr
andcut
without thetr -s
command, fastest so far:Oneliner with
awk
, fast but not the fastest,awk
withparallel
implemented according to Ole Tange reduces the real time from 5s to 2s:We can expect that the advantage with
parallel
will increase with bigger size of the input file as described by the diagram in Ola Tange's answer to this question.Speed summary: the 'real' time according to
time
rounded to 1 decimalFinally, I note that the oneliners with
sed
,python
,perl
,awk
and {parallel
&awk
} create an output file with the prescribed format.Perl solution
This script doesn't do anything in parallel but is quite fast regardless. Save it as
filter.pl
(or whatever name you prefer) and make it executable.I copied your sample data until I got 1,572,864 lines and then ran it as follows:
If you prefer one-liners, do:
Python
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)