I have files with invalid characters like these
009_-_�%86ndringshåndtering.html
It is a Æ
where something have gone wrong in the filename.
Is there a way to just remove all invalid characters?
or could tr
be used somehow?
echo "009_-_�%86ndringshåndtering.html" | tr ???
I had some japanese files with broken filenames recovered from a broken usb stick and the solutions above didn't work for me.
I recommend the detox package:
Example usage:
One way would be with sed:
Replace
file
with your filename, of course. This will replace anything that isn't a letter, number, period, underscore, or dash with an underscore. You can add or remove characters to keep as you like, and/or change the replacement character to anything else, or nothing at all.I assume you are on Linux box and the files were made on a Windows box. Linux uses UTF-8 as the character encoding for filenames, while Windows uses something else. I think this is the cause of the problem.
I would use "convmv". This is a tool that can convert filenames from one character encoding to another. For Western Europe one of these normally works:
If you need to install it on a Debian based Linux you can do so by running:
It works for me every time and it does recover the original filename.
Source: LeaseWebLabs
I assume you mean you want to traverse the filesystem and fix all such files?
Here's the way I'd do it
That would find all files with non-ascii characters and replace those characters with underscores (
_
). Use caution though, if a file with the new name already exists, it'll overwrite it. The script can be modified to check for such a case, but I didnt put that in to keep it simple.Following answers at https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters, You can use:
where
*
matches the files you want to rename. If you want to do it over multiple directories, you can do something like:You can use the -n argument to
rename
to do a dry run, and see what would be changed, without changing it.This shell script sanitizes a directory recursively, to make files portable between Linux/Windows and FAT/NTFS/exFAT. It removes control characters,
/:*?"<>\|
and some reserved Windows names likeCOM0
.Linux is less restrictive in theory (
/
and\0
are strictly forbidden in filenames) but in practice several characters interfere with bash commands (like*
...) so they should also be avoided in filenames.Great sources for file naming restrictions:
I use this one-liner to remove invalid characters in subtitle files:
It works to normalize directory names of movies:
Same steps as above but I added one more sed command to remove a period at the end of the directory
X-Men Days of Future Past (2014) [1080p]
Modified to:
X-Men.Days.of.Future.Past.2014.1080p
If you want to handle embedded newlines, multibyte characters, spaces, leading dashes, backslashes and spaces you are going to need something more robust, see this answer:
https://superuser.com/a/858671/365691
I put the script up on code.google.com if anyone is interested: r-n-f-bash-rename-script
I know this is a bit old but recently I've discovered Google's translate-shell really helps with foreign named files with unicode-choking names. Helpful batch renaming with translation in shell.
https://github.com/soimort/translate-shell
[UPDATE] The Google Translate API tends to block you if you hit it too many times but I also found a convenient local option that converts between alphabets called uconv. Helpful phonetically but not translation:
for file in *; do mv "$file" $(echo "$file" | sed -e 's/[^A-Za-z0-9.-]//g'); done &