I have "test1.csv" and it contains
200,400,600,800
100,300,500,700
50,25,125,310
and test2.csv and it contains
100,4,2,1,7
200,400,600,800
21,22,23,24,25
50,25,125,310
50,25,700,5
now
diff test2.csv test1.csv > result.csv
is different than
diff test1.csv test2.csv > result.csv
I don't know which is the correct order but I want something else, both of the commands above will output something like
2 > 100,4,2,1,7
3 2,3c3,5
4 < 100,300,500,700
5 < 50,25,125,310
6 \ No newline at end of file
7 ---
8 > 21,22,23,24,25
9 > 50,25,125,310
I want to output only the difference, thus results.csv should look like this
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
I tried diff -q
and diff -s
but they didn't do the trick. Order doesn't matter, what matters is that I want to see only the difference, no > nor < nor blank space.
grep -FvF
did the trick on smaller files not on big ones
first file contains more than 5 million lines, second file contains 1300.
so results.csv should result in ~4,998,700 lines
I also tried grep -F -x -v -f
which didn't work.
Sounds like a job for
comm
:As explained in
man comm
:So, the
-3
means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:In this case, you don't really even need to sort the files and you can simplify the above to:
Using
grep
withbash
process substitution:To save the output as
results.csv
:<()
is thebash
process substitution patterngrep -vFf test2.csv test1.csv
will find the lines unique to onlytest1.csv
grep -vFf test1.csv test2.csv
will find the lines unique to onlytest2.csv
Finally we are summing up the results by
cat
Or as Oli suggested, you can use command grouping also:
Or just run one after another, as they are both writing to STDOUT they will ultimately get added:
If the order of rows is not relevant, use
awk
orperl
:Use
grep
to get the common lines and filter those out:The internal grep gets the common lines, then the external grep finds lines which don't match these common lines.
Use the
--*-line-format=...
options ofdiff
You can tell
diff
exactly what you need - explained below:It is possible to specify the output of diff in a very detailed way, similar to a
printf
number format.The lines from the first file,
test1.csv
are called "old" lines, and the lines from the second,test2.csv
, are "new" lines. That makes sense whendiff
is used to see what changed in a file.The options we need are the ones to set the format for "old" lines, "new" lines, and "unchanged" lines.
The formats we need are very simple:
For the changed lines, new and old, we want to output only the text of the lines.
%L
is the format symbol for the line text.For the unchanged lines, we want to show nothing.
With this, we can write options like
--old-line-format='%L'
, and put it all together, using your example data:Notes on performance
Because the files have different size, try to exchange the input files if it does not matter, it could be that the inner workings of
diff
can handle one way better than the other. Better is either needing less memory, or less computation.There is an optimisation option for using
diff
with large files:--speed-large-files
. It uses assumptions about the file structure, so it's not clear whether it helps in your case, but worth trying it.The format options are described in the
man diff
under--LTYPE-line-format=LFMT
.Since the order doesn't need to be preserved, simply:
sort test1.csv test2.csv
: merges and sortstest1.csv
andtest2.csv
uniq -u
: prints only the lines which have no duplicate