Home Page
Posts > Weird filename encoding issues on windows
Search:
Weird filename encoding issues on windows
Time: 02/08/16 4:51:36 am
Tags: Encoding, UTF16, NTFS [1], UTF8 [2], Convert [4], Cygwin [7], Windows [20], Quick Fixes [33]
Most relevent 4 of 52 posts with shared tags:

So somehow all of the file names in my Rammstein music directory, and some in my Daft Punk, had characters with diacritics replaced with an invalid character. I pasted one of such filenames into a hex editor to evaluate what the problem was. First, I should note that Windows encodes its filenames (and pretty much everything) in UTF16. Everything else in the world (mostly) has settled on UTF8, which is a much better encoding for many reasons. So during some file copy/conversion at some point in the directories’ lifetime, the file names had done a freakish (utf16*)(utf16->utf8) rename, or something to that extent. I had noticed that all I needed to do was to replace the first 2 bytes of the diacritic character with a different byte. Namely “EF 8x” to “Cx”, and the rest of the bytes for the character were fine. So if anyone ever needs it, here is the bash script.

LANG=;
IFS=$'\n'
for i in `find -type f | grep -P '\xEF[\x80-\x8F]'`; do
	FROM="$i";
	TO=$(echo "$i" | perl -pi -e 's/\xEF([\x80-\x8F])/pack("C", ord($1)+(0xC0-0x80))/e');
	echo Renaming "'$FROM'" to "'$TO'"
	mv "$FROM" "$TO"
done

I may need to expand the range beyond the x80-x8F range, but am unsure at this point. I only confirmed the range x82-x83.


Comments
To add comments, please go to the forum page for this post (guest comments are allowed for the Projects, Posts, and Updates Forums).
Comments are owned by the user who posted them. We accept no responsibility for the contents of these comments.

No comments for this Post