Ping a Specific Port

Question

exa

Asked: 2011-03-25 01:08:34 +0800 CST2011-03-25 01:08:34 +0800 CST 2011-03-25 01:08:34 +0800 CST

Unix sort for partially ordered data sets

772

So I have an enormously large file (around 10GB) and need to sort it, just like in using 'sort' utility, but kindof more effectively.

Problem is, that I don't have memory, CPU power, time, nor free swapping space to power the whole sort.

The good thing is that file is already partially ordered (I can say that every line's distance from its final position is less than some value N). This kindof reminds me the classical computer-class example of using heapsort with heap of size N for this purpose.

Question: Is there some unix tool that already does that effectively, or do I need to code one myself?

Thanks -mk

2 Answers

Voted

Decado · Answer 1 · 2011-03-25T01:31:59+08:00

Best Answer

Decado

2011-03-25T01:31:59+08:002011-03-25T01:31:59+08:00

It would be easier to split the file into smaller sections and sort those. To split:-

split --lines=100000 large_file file_part.

Then sort each of those by using normal sort

for suffix in `ls file_part.* | cut -f2 -d.` 
do 
  sort file_part.${suffix} > file_sorted.${suffix} 
done

you can then combine by merge sorting

sort -m file_sorted.*

That should be much easier on your machine.

12

g24l · Answer 2 · 2011-03-25T06:15:13+08:00

g24l

2011-03-25T06:15:13+08:002011-03-25T06:15:13+08:00

Sort, is using and R-way merge sort algorithm. The fastest way to do your work would be:

sort myfile

this implies O(n logn ) time complexity and O(n) time .

If you partition the data you will probably pay it in terms of time .

The code above has an issue. The with sort -m the files are not guaranteed to be mutually sorted.

from the unix manual:

   -m, --merge
          merge already sorted files; do not sort

e.g.

file1: a b c k l q file2: d e m

sort -m file1 file2

a b c k l q d e m

which is not in sort.

Also the fact that the elements are in places that are less than N does not guarantee a sorted output with the above code:

file : a e b c d h f g

in the file N=3 and all elements are less than 3 places than their proper place

file1: h f g , file2: b c d , file3: a e

sort file1

produces :

file1: f g h , file2: b c d, file3: a e

and

sorm -m file3 file2 file1

outputs:

a e b c d f g h

which is wrong.

-1

Unix sort for partially ordered data sets

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?