I am looking to sort a list of domain names (a web filter whitelist) starting from the TLD and working upwards. I am looking any *nix or windows tools that can do this easily, though a script would be fine too.
So if the is the list you are given
www.activityvillage.co.uk
ajax.googleapis.com
akhet.co.uk
alchemy.l8r.pl
au.af.mil
bbc.co.uk
bensguide.gpo.gov
chrome.angrybirds.com
cms.hss.gov
crl.godaddy.com
digitalhistory.uh.edu
digital.library.okstate.edu
digital.olivesoftware.com
This is what I want as the output.
chrome.angrybirds.com
crl.godaddy.com
ajax.googleapis.com
digital.olivesoftware.com
digital.library.okstate.edu
digitalhistory.uh.edu
bensguide.gpo.gov
cms.hss.gov
au.af.mil
alchemy.l8r.pl
www.activityvillage.co.uk
akhet.co.uk
bbc.co.uk
Just in case you are wondering why, Squidguard, has a bug/design flaw. If both www.example.com
and example.com
are both included in a list, then the example.com
entry is ignored and you can only visit content from www.example.com
. I have several large lists that need some cleanup because someone added entries without looking first.
This simple python script will do what you want. In this example I name the file
domain-sort.py
:To run it use:
Note that this looks a little uglier since I wrote this as more or a less a simple one-liner I had to use slice notation of
[::-1]
where negative values work to make a copy of the same list in reverse order instead of using the more declarativereverse()
which does it in-place in a way that breaks the composability.And here's a slightly longer, but maybe more readable version that uses
reversed()
which returns an iterator, hence the need to also wrap it inlist()
to consume the iterator and produce a list:On a file with 1,500 randomly sorted lines it takes ~0.02 seconds:
On a file with 150,000 randomly sorted lines it takes a little over 3 seconds:
Here is an arguably more readable version that does the
reverse()
andsort()
in-place, but it runs in the same amount of time, and actually takes slightly more memory.On a file with 1,500 randomly sorted lines it takes ~0.02 seconds:
On a file with 150,000 randomly sorted lines it takes a little over 3 seconds:
Here's a PowerShell script that should do what you want. Basically it throws all the TLD's into an array reverses each TLD, sorts it, reverses it back to its original order, and then saves it to another file.
Ran it on 1,500 records - took 5 seconds on a reasonably powerful desktop.
cat domain.txt | rev | sort | rev
Slightly less cryptic, or at least prettier, Perl:
This is a simple example of a Guttman–Rosler transform: we convert the lines into the appropriate sortable form (here, split the domain name on periods and reverse the order of the parts), sort them using the native lexicographic sort and then convert the lines back to their original form.
In Unix scripting: reverse, sort and reverse:
Here it is in (short and cryptic) perl:
What this does is to reverse each filed in the domain name, sort and reverse back.
This truly sorts the domain list, lexicographically based on each part of the domain-name, from right to left.
The reverse solution (
rev <<<filename>>> | sort | rev
) , does not, I've tried it.