Web lists-archives.com

Re: Need help with multibyte UTF-8 characters




Thank you for your advice on setting my locale to en_US.UTF-8.  Unfortunately, Cygwin still seems to have trouble displaying some three-byte UTF-8 encoded characters correctly.  For example, see the following snippet from a "sed" file.  This file attempts to convert XML-encoded filenames to UTF-8.  As you can see, it converts one- and two-byte encodings correctly, but fails on some three-byte encodings (the en dash, the em dash, and the ellipsis, all of which are displayed as a filled-in rectangle):

# Match longest strings first

# Three-byte encodings:

# En dash
s/%[Ee]2%80%93/–/g

# Em dash
s/%[Ee]2%80%94/—/g

# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/…/g

# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/≤/g

# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/€/g

# Two-byte encodings:

# Non-break space
#s/%[Cc]2%[Aa]0/⎵/g

# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g

# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g

# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g

# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/í/g

# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple