I have a lot of plain text files which come from a Windows environment.
Many of them use a whacky default Windows code-page, which is neither ASCII (7 bits) nor UTF-8.
gvim has no problem opening these files, but gedit fails to do so.
gvim reports the encoding as latin1.
I assume that gvim is making a "smart" assumption about the code-page.
(I believe this code-page still has international variants).
Some questions arise from this:
(1). Is there some way the gedit can be told to recoginze this code-page?
** NB. [Update] For this point (1), see my answer, below.
** For points (2) and (3). see Oli's answer.(2). Is there a way to scan the file system to identify these problem files?
(3). Is there a batch converting tool to convert these files to UTF-8?
(.. this old-world text mayhem was actually the final straw which brought me over to Ubuntu... UTF-8 system-wide by default Brilliant)
[UPDATE]
** NB: ** I now consider the following Update to be partially irrelevent, because the "problem" files aren't the "problem" (see my answer below).
I've left it here, because is may be of some general use to someone.
I've worked out a rough and ready way to identify the problem files...
The file
command was not suitable, because it identified my example file as ASCII... but an ASCII file is 100% UTF-8 compliant...
As I mentioned in a comment below, the test for an invalid first byte of a UTF-8 codepoint is:
- if the first byte (of a UTF-8 codepoint) is between 0x80 and 0xBF (reserved for additional bytes), or greater than 0xF7 ("overlong form"), that is considered an error
I know sed
(a bit, via a Win32 port), so I've managed to cobble together a RegEx pattern which finds these offending bytes.
It's an ugly line, so look away now if regular expressions scare you :)
I'd really appreciate it if someone points out how to use hex values in a range [] expression.. I've just used the or operator \|
fqfn="/my/fully/qualified/filename"
sed -n "/\x80\|\x81\|\x82\|\x83\|\x84\|\x85\|\x86\|\x87\|\x88\|\x89\|\x8A\|\x8B\|\x8C\|\x8D\|\x8E\|\x8F\|\x90\|\x91\|\x92\|\x93\|\x94\|\x95\|\x96\|\x97\|\x98\|\x99\|\x9A\|\x9B\|\x9C\|\x9D\|\x9E\|\x9F\|\xA0\|\xA1\|\xA2\|\xA3\|\xA4\|\xA5\|\xA6\|\xA7\|\xA8\|\xA9\|\xAA\|\xAB\|\xAC\|\xAD\|\xAE\|\xAF\|\xB0\|\xB1\|\xB2\|\xB3\|\xB4\|\xB5\|\xB6\|\xB7\|\xB8\|\xB9\|\xBA\|\xBB\|\xBC\|\xBD\|\xBE\|\xBF\|\xF8\|\xF9\|\xFA\|\xFB\|\xFC\|\xFD\|\xFE\|\xFF/p" "${fqfn}"
So, I'll now graft this into Oli's batch solution... Thanks Oli!
PS. Here is the invalid UTF-8 byte it found in my sample file ...
"H.Bork, Gøte-borg." ... the "ø" = F8 hex... which is an invalid UTF-8 character.
iconv
is probably what you'll want to use.iconv -l
will show you the available encodings and then you can use a couple of commands to recode them all:If you want to do this with files you don't the encoding of (because they're all over the place), you want to bring in a few more commands:
find
,file
,awk
andsed
. The last two are just there to process the output of file.I've no idea if this actually works so I certainly wouldn't run it from anything but the least important directory you have (make a testing folder with some known ASCII files in). The syntax of find might preclude it from being within a for loop. I'd hope that somebody else with more bash experience could jump in there and sort it out so it does the right thing.
Gedit can detect the correct character set only if it is listed at "File-Open-Character encoding". You can alter this list but keep in mind that the order is important.
I've been thinking about this a bit more...
Yes, the "ø" = 0xF8 hex* was definitely the reason why gedit would not open the file...
Why? Because it is not a valid UTF-8 byte.
By default, gedit will only open UTF-8 files...
However, gedit does have a codepage auto-detect feature, but you must first Add codepages to its list of "possibles".
The bright red dialog which appears when gedit can't recognize the code-page, has a buttone on it which allows you to Add another codepage...
Problem solved!... almost ...
The knarly issue now raise its head again.... Which codepage is it?
In my situation, I can reasolably assume that it is the standard English Windows codepage (for my region?, or for the region of the file's origin? .. I did mention "knarly" :)....
Anyhow, gedit will allow you to load a file once you have Added the codepage to its list...
So, although all the Terminal commands are useful and interesting in their own right, it seems that that line of thought was heading up the wrong track.
There is nothing intrinsicly wrong in these files...
The issue seems to be purely about codepages.
gedit can open the file, just as gvim can.
...but the relevant codepage must first be Added to its codepage list.
eg. via th File-Open dialog, or the red warning dialog I encounterd.
You can use any of the 3 command lines :