For official Ubuntu documentation where the source English files are in docbook xml, there is a requirement of ASCII only characters. We use a "checker" command line (see here).
grep --color='auto' -P -n "[\x80-\xFF]" *.xml
However, the command has a flaw, apparently not on all computers, it misses some lines with non-ASCII characters, potentially resulting in a false O.K. result.
Does anyone have a better suggestion for a ASCII checker command line?
Interested persons might consider to use this file (text file, not a docbook xml file) as a test case. The first three lines with non ASCII characters are lines 9, 14 and 18. Lines 14 and 18 were missed in the check:
$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?
You can print all non-ASCII lines of a file using my Python 3 script that I am hosting on GitHub here:
GitHub: ByteCommander/encoding-check
You can either clone or download the entire repository or simply save the file
encoding-check
and make it executable usingchmod +x encoding-check
.Then you can run it like this, with the file to check as only argument:
./encoding-check FILENAME
if it's located in your current working directory, or.../path/to/encoding-check FILENAME
if it's located in/path/to/
, or...encoding-check FILENAME
if it's located in a directory that is part of the$PATH
environment variable, i.e./usr/local/bin
or~/bin
.Without any optional arguments, it will print each line and its number where it found non-ASCII characters. Finally, there's a summary line that tells you how many lines the file had in total and how many of them contained non-ASCII characters.
This method is guaranteed to properly decode all ASCII characters and detect everything that is definitely not ASCII.
Here's an example run on a file containing the first 20 lines of your given
install.en.txt
:But the script has some additional arguments to tweak the checked encoding and the output format. View the help and try them:
As
--encoding
, every codec that Python 3 knows is valid. Just try one, in the worst case you get a little error message...If you want to look for non-ASCII characters, perhaps you should invert the search to exclude ASCII characters:
For example:
In lines 9, 330, 337 and 359, Unicode non-breaking space characters are present.
The particular output you get maybe due to
grep
's support for UTF-8. For a Unicode locale, some of those characters may compare equal to a normal ASCII character. Forcing the C locale will show the expected results in that case:This Perl command mostly replaces that
grep
command (the thing missing being the colors):n
: causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk:-e
: may be used to enter one line of program./[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)
: If the line contains a character in the range\x80-\xFF
, prints the current file's name, the current file's line number, a:\t^
string and the current line's content.Output on a sample directory containing the sample file in the question and a file containing only
ààààà
and a newline character: