I would expect bash sort to compare strings like this:
- Start at the first char (of both strings)
- If the chars are equal, proceed to the next char
- If they are unequal, return greater/lesser result to sort algorithm
- If there are no more chars, return equals
For some reason, it seems like this is not a case.
Let's take the following input:
a
b
.
-
This is sorted by bash sort as
-
.
a
b
Now, for input
b.de
bb.de
I would expect the following sort result:
b.de
bb.de
Because the first char is equal, and for the second char, .
comes before b
(as seen in the first test).
For some reason, this is not the case, the strings are sorted like this:
bb.de
b.de
Why is sort
behaving this way, and is there a way to make it behave "as expected"?
I have tested the same examples with python, and python sorts as expected.
Sort by default does a locale aware sort which uses the lexicographical rules for your locale. see strcoll(3)
ltrace(3) got me this:
strcoll("b.de", "bb.de") = 20
locate-aware comparisons seem to split strings into words and sort on that. as words nver start with '.' sort sees a 0 lenfgh words and puts that at the start of the list. however '.' is alloerd in wordd eg: "Jr." "Ph.D"
if you require a byte-wise comparison instead export LC_COLLATE=C or LC_COLLATE=POSIX
I checked the
coreutils
package and if you dont provide any arguments, it looks as if it (eventually) uses the Cstrcmp
routine. The only case that isn't true is where the values in lines can be interpreted as integers.The man page of which says:
This means that the
strcmp
ofbb.de
andb.de
really is down to the last character.That is
if 'd' < 'e'
which (in ascii at least) would beif 100 < 101
which is true.