I had some problems with subtitle files in video omxplayer. To solve it I had to convert from windows-1250 to UTF-8 encoding. My question is, how can I see for some specific file which encoding is used?
I had some problems with subtitle files in video omxplayer. To solve it I had to convert from windows-1250 to UTF-8 encoding. My question is, how can I see for some specific file which encoding is used?
You can not really automatically find out whether a file was written with encoding X originally.
What you can easily do though is to verify whether the complete file can be successfully decoded somehow (but not necessarily correctly) using a specific codec. If you find any bytes that are not valid for a given encoding, it must be something else.
The problem is that many codecs are similar and have the same "valid byte patterns", just interpreting them as different characters. For example, an
ä
in one encoding might correspond toé
in another orø
in a third. The computer can't really detect which way to interpret the byte results in correctly human readable text (unless maybe if you add a dictionary for all kinds of languages and let it perform spell checks...). You must also know that some character sets are actually subsets of others, like e.g. the ASCII encoding is a part of most commonly used codecs like some of the ANSI family or UTF-8. That means for example a text saved as UTF-8 that only contains simple latin characters, it would be identical to the same file saved as ASCII.However, let's get back from explaining what you can't do to what you actually can do:
For a basic check on ASCII / non-ASCII (normally UTF-8) text files, you can use the
file
command. It does not know many codecs though and it only examines the first few kB of a file, assuming that the rest will not contain any new characters. On the other hand, it also recognizes other common file types like various scripts, HTML/XML documents and many binary data formats (which is all uninteresting for comparing text files though) and it might print additional information whether there are extremely long lines or what type of newline sequence (e.g. UNIX: LF, Windows: CR+LF) is used.If that is not enough, I can offer you the Python script I wrote for this answer here, which scans complete files and tries to decode them using a specified character set. If it succeeds, that encoding is a potential candidate. Otherwise if there are any bytes that can not be decoded with it, you can remove that character set from your list.
A program named
file
can do this. Example:If you're interested in how it's done see
src/encoding.c
.If you're looking for an alternative to
file
I really recommend detect-file-encoding-and-language!The downside is that it requires some extra steps. You have to have Node.js and NPM installed in order to be able to use it.
You can install Node.js and NPM like this:
Then install detect-file-encoding-and-language:
Finally, detect the encoding like so: