I have a 2958616 byte text file. When I run sort < file.txt | uniq > sorted-file.txt
, I get a 3213965 byte text file. Why is my sorted text file bigger?
You can download the text files here.
I have a 2958616 byte text file. When I run sort < file.txt | uniq > sorted-file.txt
, I get a 3213965 byte text file. Why is my sorted text file bigger?
You can download the text files here.
While your original file has lines that end with
\n
, your sorted file has\r\n
. The addition of the\r
is what changes the size.To illustrate, here's what happens when I run your command on my Linux system:
As you can see, the sorted de-duped file is a few lines shorter and, consequently, a few bytes smaller. Your file, however, is different:
The two files have exactly the same number of lines, but:
The
sorted-file.txt
, the one I downloaded from your link, is larger. If we now examine the first line, we can see the extra\r
:Which aren't present in the one I created on Linux:
If we now remove the
\r
from your file:We get the expected result, a file that is smaller than the original, just like the one I created on my system:
hexdump
reveals it!Your sorted file is bigger because it uses Windows line endings
\r\n
(two bytes) instead of Linux line endings\n
(one byte).Could it be that you were running that command above under Windows using either tools like
cygwin
or this new Linux subsystem for Windows 10? Or did you maybe run something in Wine?