In this answer (How can I remove the first line of a file with sed?) there are two ways to delete the first record in a file:
sed '1d' $file >> headerless.txt
** ---------------- OR ----------------**
tail -n +2 $file >> headerless.txt
Personally I think the tail
option is cosmetically more pleasing and more readable but probably because I'm sed-challenged.
Which method is fastest?
Performance of
sed
vs.tail
to remove the first line of a fileTL;DR
sed
is very powerful and versatile, but this is what makes it slow, especially for large files with many lines.tail
does just one simple thing, but that one it does well and fast, even for bigger files with many lines.For small and medium sized files,
sed
andtail
are performing similarly fast (or slow, depending on your expectations). However, for larger input files (multiple MBs), the performance difference grows significantly (an order of magnitude for files in the range of hundreds of MBs), withtail
clearly outperformingsed
.Experiment
General Preparations:
Our commands to analyze are:
Note that I'm piping the output to
/dev/null
each time to eliminate the terminal output or file writes as performance bottleneck.Let's set up a RAM disk to eliminate disk I/O as potential bottleneck. I personally have a
tmpfs
mounted at/tmp
so I simply placed mytestfile
there for this experiment.Then I am once creating a random test file containing a specified amount of lines
$numoflines
with random line length and random data using this command (note that it's definitely not optimal, it becomes really slow for about >2M lines, but who cares, it's not the thing we're analyzing):Oh, btw. my test laptop is running Ubuntu 16.04, 64 bit on an Intel i5-6200U CPU. Just for comparison.
Timing big files:
Setting up a huge
testfile
:Running the command above with
numoflines=10000000
produced a random file containing 10M lines, occupying a bit over 600 MB - it's quite huge, but let's start with it, because we can:Perform the timed run with our huge
testfile
:Now let's do just a single timed run with both commands first to estimate with what magnitudes we're working.
We already see a really clear result for big files,
tail
is a magnitude faster thansed
. But just for fun and to be sure there are no random side effects making a big difference, let's do it 100 times:The conclusion stays the same,
sed
is inefficient to remove the first line of a big file,tail
should be used there.And yes, I know Bash's loop constructs are slow, but we're only doing relatively few iterations here and the time a plain loop takes is not significant compared to the
sed
/tail
runtimes anyway.Timing small files:
Setting up a small
testfile
:Now for completeness, let's look at the more common case that you have a small input file in the kB range. Let's create a random input file with
numoflines=100
, looking like this:Perform the timed run with our small
testfile
:As we can expect the timings for such small files to be in the range of a few milliseconds from experience, let's just do 1000 iterations right away:
As you can see, the timings are quite similar, there's not much to interpret or wonder about. For small files, both tools are equally well suited.
Here's another alternative, using just bash builtins and
cat
:$file
is redirected into the{ }
command grouping. Theread
simply reads and discards the first line. The rest of the stream is then piped tocat
which writes it to the destination file.On my Ubuntu 16.04 the performance of this and the
tail
solution are very similar. I created a largish test file withseq
:tail
solution:cat
/brace solution:I only have an Ubuntu VM handy right now though, and saw significant variation in the timings of both, though they're all in the same ballpark.
Trying in on my system, and prefixing each command with
time
I got the following results:sed:
and tail:
which suggest that, on my system at least AMD FX 8250 running Ubuntu 16.04, tail is significantly faster. The test file had 10,000 lines with a size of 540k. The file was read from a HDD.
There is no objective way to say which is better, because
sed
andtail
aren't the only things that run on a system during program execution. A lot of factors such as disk i/o, network i/o, CPU interrupts for higher priority processes - all those influence how fast your program will run.Both of them are written in C, so this is not language issue, but more of environmental one. For example, I have SSD and on my system this will take time in microseconds, but for same file on hard drive it will take more time because HDDs are significantly slower. So hardware plays role in this,too.
There's a few things that you may want to keep in mind when considering which command to choose:
sed
is stream editor for transforming text.tail
is for outputting specific lines of text. If you want to deal with lines and only print them out , usetail
. If you want to edit the text, usesed
.tail
has far simpler syntax thansed
, so use what you can read yourself and what others can read.Another important factor is the amount of data you're processing. Small files won't give you any performance difference. The picture gets interesting when you're dealing with big files. With a 2 GB BIGFILE.txt, we can see that
sed
has far more system calls thantail
, and runs considerably slower.Top answer didn't take disk into account doing
> /dev/null
if you have a large file and don't want to create a temporary duplicate on your disk try
vim -c
Edit: if the file is larger than available memory
vim -c
doesn't work, looks like its not smart enough to do an incremental load of the fileOther answers show well what is better to create a new file with first line missing. If you want to edit a file as opposed to create a new file though, I bet
ed
would be faster because it shouldn't create a new file at all. But you have to search how to remove a line withed
because I used it only once.