Say I have a file named jobs.csv and I would like to get the top 50k jobs done by Foo
I can either do:
# cat jobs.csv | sort -u | head -n 50000 > /tmp/jobs.csv
# cat /tmp/jobs.csv | while read line; do Foo --job=$line; done
Or
# cat jobs.csv | sort -u | head -n 50000 | while read line; do Foo --job=$line; done
Can one tell which one is better in terms of system's IO/Memory efficiency?
Or even better, can one come up with a better solution for this?
I normally go for the second option (pipes all the way) unless one of the intermediate outputs is useful to me for another task. For example, if after running Foo against 50k jobs, you then wanted to run Bar against the same jobs, it would be useful to have
/tmp/jobs.csv
available.Using pipes all the way gives the system the ability to forget about data at the earliest possible time, so it is a more efficient use of memory. It also bypasses the VFS and tmpfs stacks and so it uses marginally less CPU. The overall performance of the chain is faster as well because you don't need to wait for one step to finish before starting the next step (except if the particular program requires it).
By the way, in your example the biggest user of memory would be the
sort
stage because it needs to keep the entire contents ofjobs.csv
in memory in order to sort it. You can make it more efficient by improving whatever createsjobs.csv
in the first place so that you no longer need thesort -u
.