So I have an enormously large file (around 10GB) and need to sort it, just like in using 'sort' utility, but kindof more effectively.
Problem is, that I don't have memory, CPU power, time, nor free swapping space to power the whole sort.
The good thing is that file is already partially ordered (I can say that every line's distance from its final position is less than some value N). This kindof reminds me the classical computer-class example of using heapsort with heap of size N for this purpose.
Question: Is there some unix tool that already does that effectively, or do I need to code one myself?
Thanks -mk
It would be easier to split the file into smaller sections and sort those. To split:-
Then sort each of those by using normal sort
you can then combine by merge sorting
That should be much easier on your machine.
Sort, is using and R-way merge sort algorithm. The fastest way to do your work would be:
this implies O(n logn ) time complexity and O(n) time .
If you partition the data you will probably pay it in terms of time .
The code above has an issue. The with sort -m the files are not guaranteed to be mutually sorted.
from the unix manual:
e.g.
file1: a b c k l q file2: d e m
a b c k l q d e m
which is not in sort.
Also the fact that the elements are in places that are less than N does not guarantee a sorted output with the above code:
file : a e b c d h f g
in the file N=3 and all elements are less than 3 places than their proper place
file1: h f g , file2: b c d , file3: a e
produces :
file1: f g h , file2: b c d, file3: a e
and
outputs:
a e b c d f g h
which is wrong.