Web lists-archives.com

Re: UTF-8 character encoding




On 6/24/18, L A Walsh <cygwin@xxxxxxxxx> wrote:
> Lee wrote:
>> So... keep it simple, set
>>   LANG=en_US.UTF-8
>> and use vi or something else that comes with cygwin to create the file
>> and I'll have a file with UTF-8 character encoding - correct?
> ---
> 	The first 127 characters of UTF-8 are identical to the
> first 127 characters of ASCII, and latin1 and iso-8859-1.
>
> If you don't use any characters that need accents or special symbols,
> then nothing will be encoded in UTF-8, because its only
> the characters OVER the first 127
> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
0xff is part of the utf-8 encoding.  This chart makes things clearer
... at least for me :)
    http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
 The proposed UCS transformation format encodes UCS values in the range
 [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
 bytes.  For all encodings of more than one byte, the initial byte
 determines the number of bytes used and the high-order bit in each byte
 is set.

 An easy way to remember this transformation format is to note that the
 number of high-order 1's in the first byte is the same as the number of
 subsequent bytes in the multibyte character:

    Bits  Hex Min  Hex Max         Byte Sequence in Binary
 1    7  00000000 0000007f 0zzzzzzz
 2   13  00000080 0000207f 10zzzzzz 1yyyyyyy
 3   19  00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
 4   25  00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
 5   31  02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv

Thanks
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple