Ping a Specific Port

Question

rwired

Asked: 2009-11-10 00:06:47 +0800 CST2009-11-10 00:06:47 +0800 CST 2009-11-10 00:06:47 +0800 CST

How to tell the language encoding of a filename on Linux?

772

I have a directory with ~10,000 image files from an external source.

Many of the filenames contain spaces and punctuation marks that are not DB friendly or Web friendly. I also want to append a SKU number to the end of every filename (for accounting purposes). Many, if not most of the filenames also contain extended latin characters which I want to keep for SEO purposes (specifically so the filenames accurately represent the file contents in Google Images)

I have made a bash script which renames (copies) all the files to my desired result. The bash script is saved in UTF-8. After running it omits approx 500 of the files (unable to stat file...).

I have run convmv -f UTF-8 -t UTF-8 on the directory, and discovered these 500 filenames are not encoded in UTF-8 (convmv is able to detect and ignore filenames already in UTF-8)

Is there an easy way I can find out which language encoding they are currently using?

The only way I've been able to figure out myself is by setting my terminal encoding to UTF-8, then iterating through all the likely candidate encodings with convmv until it displays a converted name that 'looks right'. I have no way to be certain that these 500 files all use the same encoding, so I would need to repeat this process 500 times. I would like a more automated method than 'looks right' !!!

3 Answers

Voted

Philip Reynolds · Answer 1 · 2009-11-10T03:21:01+08:00

Best Answer

Philip Reynolds

2009-11-10T03:21:01+08:002009-11-10T03:21:01+08:00

There's no 100% accurate way really, but there's a way to give a good guess.

There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet

e.g.

See what the current LANG variable is set to:

$ echo $LANG
en_IE.UTF-8

Create a filename that'll need to be encoded with UTF-8

$ touch mÉ.txt

Change our encoding and see what happens when we try and list it

$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt

OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).

So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}

As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.

15

Klaus Kappel · Answer 2 · 2012-09-01T05:35:29+08:00

Klaus Kappel

2012-09-01T05:35:29+08:002012-09-01T05:35:29+08:00

You may find this useful, to test the current working directory (python 2.7):

import chardet
import os  

for n in os.listdir('.'):
    print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

Result looks like:

Vorlagen => ascii (1.0)
examples.desktop => ascii (1.0)
Öffentlich => ISO-8859-2 (0.755682154041)
Videos => ascii (1.0)
.bash_history => ascii (1.0)
Arbeitsfläche => EUC-KR (0.99)

To recurse trough path from current directory, cut-and-paste this into a little python script:

#!/usr/bin/python

import chardet
import os

for root, dirs, names in os.walk('.'):
    print root
    for n in names:
        print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence'])

7

svalo · Answer 3 · 2021-06-18T23:01:16+08:00

svalo

2021-06-18T23:01:16+08:002021-06-18T23:01:16+08:00

Landing here in 2021 using python3 I found @philip-reynoldsn @klaus-kappel answers useful but not functional anymore as chardet.detect() expects a byte-like object. I slightly edited the code to get the encoding of all files in current working directory as follows:

import os
import chardet
for n in os.listdir('.'):
    chardet.detect(os.fsencode(n))

2

How to tell the language encoding of a filename on Linux?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?