I'm developing a bash script and came up with the following strange behaviour!
$ echo £ |cut -c 1
�
The sign £
is passed to the next command cut
whose filter is picking one character only.
When I modify the filter in the cut
command to pick 2 characters, then the £
is passed through!
$ echo £ |cut -c 1-2
£
Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £
sign?
The
cut
command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of thecut
command.The pound sign (
£
) is a UTF-8 character that consists of two bytes (c2
anda3
):Note: The
0a
character is the "New Line" (ASCII "Line Feed" character).When you
cut
the first character from the line, you are selecting only thec2
part of£
, and this is not a valid UTF-8 character. As a result you get the strange question mark�
(the replacement character) on screen:Note: The above was tested with the latest version of
cut
in Ubuntu 20.10 (GNU coreutils version 8.32).If you want to select multi-byte characters, you can use the
grep
(GNU grep version 3.4) command like this:This answer was improved with the help of the comments.
In UTF-8 encoding, the hex value of
£
is0xC2 0xA3 (c2a3)
which is11000010 10100011
in binary.So it's two bytes (like two character).
cut -c
considers each byte a character which produces�
.