I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.
For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
So output should be like that:
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
How can I do that without sorting?
To remove duplicates based on a single column, you can use
awk
:You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
Here, in the middle block,
{seen[$1] = $0}
saves the current line ($0
) to theseen
array with the first field ($1
) as index, then saves the first field in theprev
variable. Thisprev
is used in the first block when processing the next line.In the first block, then, we check if
prev
is set (only true for the second line onwards) and not equal to the current first field (hereprev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At theEND
, we do that again for the last line.Using tac and uniq.