How can I use docker without sudo?

Question

DK Bose

Asked: 2014-03-16 02:28:20 +0800 CST2014-03-16 02:28:20 +0800 CST 2014-03-16 02:28:20 +0800 CST

Identify duplicate lines in a file without deleting them?

772

I have my references as a text file with a long list of entries and each has two (or more) fields.

The first column is the reference's url; the second column is the title which may vary a bit depending on how the entry was made. Same for third field which may or may not be present.

I want to identify but not remove entries that have the first field (reference url) identical. I know about sort -k1,1 -u but that will automatically (non-interactively) remove all but the first hit. Is there a way to just let me know so I can choose which to retain?

In the extract below of three lines that have the same first field (http://unix.stackexchange.com/questions/49569/), I would like to keep line 2 because it has additional tags (sort, CLI) and delete lines #1 and #3:

http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

Is there a program to help identify such "duplicates"? Then, I can manually clean up by personally deleting lines #1 and #3?

6 Answers

Voted

Lekensteyn · Answer 1 · 2014-03-16T03:13:15+08:00

This is a classical problem that can be solved with the uniq command. uniq can detect duplicate consecutive lines and remove duplicates (-u, --unique) or keep duplicates only (-d, --repeated).

Since ordering of duplicate lines is not important for you, you should sort it first. Then use uniq to print unique lines only:

sort yourfile.txt | uniq -u

There is also a -c (--count) option that prints the number of duplicates for the -d option. See the manual page of uniq for details.

If you really do not care about the parts after the first field, you can use the following command to find duplicate keys and print each line number for it (append another | sort -n to have the output sorted by line):

 cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D

Since you want to see duplicate lines (using the first field as key), you cannot directly use uniq. The issue that make automation difficult is that the title parts vary, but a program cannot automatically determine which title should be considered the final one.

Here is an AWK script (save it to script.awk) that takes your text file as input and prints all duplicate lines so you can decide which to delete. (awk -f script.awk yourfile.txt)

#!/usr/bin/awk -f
{
    # Store the line ($0) grouped per URL ($1) with line number (NR) as key
    lines[$1][NR] = $0;
}
END {
    for (url in lines) {
        # find lines that have the URL occur multiple times
        if (length(lines[url]) > 1) {
            for (lineno in lines[url]) {
                # Print duplicate line for decision purposes
                print lines[url][lineno];
                # Alternative: print line number and line
                #print lineno, lines[url][lineno];
            }
        }
    }
}

Radu Rădeanu · Answer 2 · 2014-03-16T03:30:49+08:00

Best Answer

Radu Rădeanu

2014-03-16T03:30:49+08:002014-03-16T03:30:49+08:00

If I understand your question, I think that you need something like:

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

or:

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

where file.txt is your file containing data about you are interested.

In the output you will see the number of the lines and lines where first field is found two or more times.

9

terdon · Answer 3 · 2014-03-16T10:14:09+08:00

If I read this correctly, all you need is something like

awk '{print $1}' file | sort | uniq -c | 
    while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done

That will print out the number of the line that contains the dupe and the line itself. For example, using this file:

foo bar baz
http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
bar foo baz
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
baz foo bar
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

It will produce this output:

2:http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
4:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
6:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

To print only the number of the line, you could do

awk '{print $1}' file | sort | uniq -c | 
 while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1

And to print only the line:

awk '{print $1}' file | sort | uniq -c | 
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-

Explanation:

The awk script just prints the 1st space separated field of the file. Use $N to print the Nth field. sort sorts it and uniq -c counts the occurrences of each line.

This is then passed to the while loop which saves the number of occurrences as $num and the line as $dupe and if $num is greater than one (so it's duplicated at least once) it will search the file for that line, using -n to print the line number. The -- tells grep that what follows is not a command line option, useful for when $dupe can start with -.

Jacob Vlijm · Answer 4 · 2014-03-17T07:05:32+08:00

No doubt the most verbose one in the list, could probably be shorter:

#!/usr/bin/python3
import collections
file = "file.txt"

def find_duplicates(file):
    with open(file, "r") as sourcefile:
        data = sourcefile.readlines()
    splitlines = [
        (index, data[index].split("  ")) for index in range(0, len(data))
        ]
    lineheaders = [item[1][0] for item in splitlines]
    dups = [x for x, y in collections.Counter(lineheaders).items() if y > 1]
    dupsdata = []
    for item in dups:
        occurrences = [
            splitlines_item[0] for splitlines_item in splitlines\
                       if splitlines_item[1][0] == item
            ]
        corresponding_lines = [
            "["+str(index)+"] "+data[index] for index in occurrences
            ]
        dupsdata.append((occurrences, corresponding_lines))

    # printing output   
    print("found duplicates:\n"+"-"*17)
    for index in range(0, len(dups)):
        print(dups[index], dupsdata[index][0])
        lines = [item for item in dupsdata[index][1]]
        for line in lines:
            print(line, end = "")


find_duplicates(file)

gives on a textfile like:

monkey  banana
dog  bone
monkey  banana peanut
cat  mice
dog  cowmeat

an output like:

found duplicates:
-----------------
dog [1, 4]
[1] dog  bone
[4] dog  cowmeat
monkey [0, 2]
[0] monkey  banana
[2] monkey  banana peanut

Once you picked the lines to remove:

removelist = [2,1]

def remove_duplicates(file, removelist):
    removelist = sorted(removelist, reverse=True)
    with open(file, "r") as sourcefile:
        data = sourcefile.readlines()
    for index in removelist:
        data.pop(index)
    with open(file, "wt") as sourcefile:
        for line in data:
            sourcefile.write(line)

remove_duplicates(file, removelist)

DK Bose · Answer 5 · 2014-03-16T04:16:38+08:00

See the following sorted file.txt:

addons.mozilla.org/en-US/firefox/addon/click-to-play-per-element/ ::: C2P per-element
addons.mozilla.org/en-us/firefox/addon/prospector-oneLiner/ ::: OneLiner
askubuntu.com/q/21033 ::: What is the difference between gksudo and gksu?
askubuntu.com/q/21148 ::: openoffice calc sheet tabs (also askubuntu.com/q/138623)
askubuntu.com/q/50540 ::: What is Ubuntu's Definition of a "Registered Application"?
askubuntu.com/q/53762 ::: How to use lm-sensors?
askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors
stackoverflow.com/q/4594319 ::: bash - shell replace cr\lf by comma
stackoverflow.com/q/4594319 ::: shell replace cr\lf by comma
wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence
wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence - Ubuntu Wiki
www.youtube.com/watch?v=1olY5Qzmbk8 ::: Create new mime types in Ubuntu
www.youtube.com/watch?v=2hu9JrdSXB8 ::: Change mouse cursor
www.youtube.com/watch?v=Yxfa2fXJ1Wc ::: Mouse cursor size

Because the list is short, I can see (after sorting) that there are three sets of duplicates.

Then, for example, I can choose to keep:

askubuntu.com/q/53762 ::: How to use lm-sensors?

rather than

askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors

But for a longer list this will be difficult. Based on the two answers one suggesting uniq and the other suggesting cut, I find that this command gives me the output I would like:

$ cut -d " " -f1 file.txt | uniq -d
askubuntu.com/q/53762
stackoverflow.com/q/4594319
wiki.ubuntu.com/ClipboardPersistence
$

Clint Smith · Answer 6 · 2016-04-16T15:17:39+08:00

Clint Smith

2016-04-16T15:17:39+08:002016-04-16T15:17:39+08:00

Her is how I solved it:

file_with_duplicates:

1,a,c
2,a,d
3,a,e <--duplicate
4,a,t
5,b,k <--duplicate
6,b,l
7,b,s
8,b,j
1,b,l
3,a,d <--duplicate
5,b,l <--duplicate

File sorted and deduped by columns 1 and 2:

sort -t',' -k1,1 -k2,2 -u file_with_duplicates

File sorted only by columns 1 and 2:

sort -t',' -k1,1 -k2,2 file_with_duplicates

Show the difference only:

diff <(sort -t',' -k1,1 -k2,2 -u file_with_duplicates) <(sort -t',' -k1,1 -k2,2 file_with_duplicates)

 3a4
   3,a,d
 6a8
   5,b,l

0

Identify duplicate lines in a file without deleting them?

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?