I need to compare the contents of text files (as a whole and not line-wise like diff command) and print out the missing text. Is there a command to do so? Thanks in advance.
EDIT: As an example, say file1 has:
1 2 3
4 5
file 2 has got:
1 5
2 3 4 6
I want to compare these files and print as output:
6
The command diff
compares the text files line by line, in which case, almost the entire file will be printed out. (My actual files are more complicated and lengthy, so I'm giving a simple example.)
Since the order does not matter, you can use awk for printing unique lines, but with whitespace separating lines:
Here:
-v RS='[[:space:]]+'
sets the record separator (RS) to any whitespace, so all each "line" will be separated by any whitespace (including newlines).FNR == NR
-FNR
is the record number (NR
) (or, line number, if you will) for the current file, andNR
is the overall line number in all input files. So, whenever these two are equal, we're dealing with the first file.{a[$0]++; next}
- set and increment the count of appearances of the current "line", then move to the next line without processing any more rules. This block is only run for the first file, so the effect is that this rule applies to lines from the first file, and the next block applies to to all other files.{a[$0]--}
, decrement the count of appearances of the current "line".END {for (i in a) if (a[i] != 0) print i}
- at the END of all input, for each entry in the arraya
, print that entry if the count of appearances is not 0. So, any "line" which was seen an equal number of times in both files will be skipped.This is another solution:
where:
tr ' \t' '\n' | sort
replace space and tab with newline, and reorder the resultcomm
Compares sorted filesfile1
andfile2
line by line, and with-3
option suppress lines that appear in both filessort -u
at last, removes duplicate lines, this is necessary in case of duplicate tokenIn this case
tr ' ' '\n' | sort
output is used as standard input forcomm
command.I'll assume the following: Your files consist of data, that are either separated with a space or newline AND you do not care about knowing where the data is missing (or know it is always e.g. in file2).
What we will do is simple: Replace every space with a newline in both files, concatenate them, then search for single (unique) entries only:
Preparation of the files
If I understand correctly, you want to print unique numbers or words in the comparison. I would convert the files so that each number/word has an own line, sort them, remove blank lines and duplicates and after that compare the files.
I assume that space characters are separating the numbers or words. For each file
See
man tr
for details.If you want to sort numerically, you can add the option
-n
or-h
depending of the numeric format. Seeman sort
for details.The example of your original question,
In the example of the original question, I would use
-n
,where x can be a and b for the two files to compare, so for example
You can inspect these converted and sorted files, if you wish.
Finally compare the files with the following command line,
This identifies the number, that is only found in the second file of your sample input from the original question. See
man diff
for details, if you want a modified output fromdiff
.My example
This is an example with a unique number in each of the files.
This identifies
The example in Lety's comment
In this case my method shows a different result compared to the methods of @Lety and @muru. Let us wait for the OP, @samhitha, to tell us what is the desired output of the comparison.
Another solution (python):