I have my references as a text file with a long list of entries and each has two (or more) fields.
The first column is the reference's url; the second column is the title which may vary a bit depending on how the entry was made. Same for third field which may or may not be present.
I want to identify but not remove entries that have the first field (reference url) identical. I know about sort -k1,1 -u
but that will automatically (non-interactively) remove all but the first hit. Is there a way to just let me know so I can choose which to retain?
In the extract below of three lines that have the same first field (http://unix.stackexchange.com/questions/49569/
), I would like to keep line 2 because it has additional tags (sort, CLI) and delete lines #1 and #3:
http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field
http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI
http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field
Is there a program to help identify such "duplicates"? Then, I can manually clean up by personally deleting lines #1 and #3?
This is a classical problem that can be solved with the
uniq
command.uniq
can detect duplicate consecutive lines and remove duplicates (-u
,--unique
) or keep duplicates only (-d
,--repeated
).Since ordering of duplicate lines is not important for you, you should sort it first. Then use
uniq
to print unique lines only:There is also a
-c
(--count
) option that prints the number of duplicates for the-d
option. See the manual page ofuniq
for details.If you really do not care about the parts after the first field, you can use the following command to find duplicate keys and print each line number for it (append another
| sort -n
to have the output sorted by line):Since you want to see duplicate lines (using the first field as key), you cannot directly use
uniq
. The issue that make automation difficult is that the title parts vary, but a program cannot automatically determine which title should be considered the final one.Here is an AWK script (save it to
script.awk
) that takes your text file as input and prints all duplicate lines so you can decide which to delete. (awk -f script.awk yourfile.txt
)If I understand your question, I think that you need something like:
or:
where
file.txt
is your file containing data about you are interested.In the output you will see the number of the lines and lines where first field is found two or more times.
If I read this correctly, all you need is something like
That will print out the number of the line that contains the dupe and the line itself. For example, using this file:
It will produce this output:
To print only the number of the line, you could do
And to print only the line:
Explanation:
The
awk
script just prints the 1st space separated field of the file. Use$N
to print the Nth field.sort
sorts it anduniq -c
counts the occurrences of each line.This is then passed to the
while
loop which saves the number of occurrences as$num
and the line as$dupe
and if$num
is greater than one (so it's duplicated at least once) it will search the file for that line, using-n
to print the line number. The--
tellsgrep
that what follows is not a command line option, useful for when$dupe
can start with-
.No doubt the most verbose one in the list, could probably be shorter:
gives on a textfile like:
an output like:
Once you picked the lines to remove:
See the following sorted
file.txt
:Because the list is short, I can see (after sorting) that there are three sets of duplicates.
Then, for example, I can choose to keep:
rather than
But for a longer list this will be difficult. Based on the two answers one suggesting
uniq
and the other suggestingcut
, I find that this command gives me the output I would like:Her is how I solved it:
file_with_duplicates:
File sorted and deduped by columns 1 and 2:
File sorted only by columns 1 and 2:
Show the difference only: