Bash: Fixing an ASCII text file changed with Unicode character sequences

File encoding issues are difficult to diagnose and troubleshoot.  Most files in the operations world are expected to be text-only ASCII 7 bit, so if a file goes into UTF-8 encoding and has embedded Unicode character inserted, it can often throw off the tool chain or systems using the file.

Here are two example files, and we can see how the file encoding is different, and then how to determine the embedded Unicode character that might be hidden from your editor.

# create two files, one with embedded unicode
$ printf 'Hello, World!' > test-ascii.txt
$ printf 'Hello,\xE2\x98\xA0World!' > test-utf8.txt

# show file types
$ file -bi test-ascii.txt
text/plain; charset=us-ascii

$ file -bi test-utf8.txt
text/plain; charset=utf-8

# errors and shows exactly which character has unicode
$ iconv -t UTF-8 -t ASCII test-utf8.txt
Hello,iconv: illegal input sequence at position 6

You can go in with an editor of choice and delete these Unicode embedded characters.  But beware that unless you are in hex mode, you might not see some Unicode characters (e.g. BOM byte-order-mark).

To convert, you can use iconv to delete or translate these characters and produce an ASCII version of the file.

# throw away unicode sequences, ignore stderr
iconv -f UTF8 -t ASCII//IGNORE test-utf8.txt > test-utf-to-ascii.txt

# attempt conversion of unicode sequences, uses '?' if no translation
iconv -f UTF8 -t ASCII//TRANSLIT test-utf8.txt > test-utf-to-ascii.txt

# validate that file encoding changed
file -bi test-utf-to-ascii.txt

Here is a  link to my supporting github bash script.

REFERENCES

iconv man page showing //IGNORE //TRANSLIT

stackexchange, using vim as a hex debugger “%!xxd” and going back with “%!xxd -r”

stackoverflow, showing file encoding type using ‘file -bi’

tecmint.com, iconv conversions