Web lists-archives.com

Re: utf

Hash: SHA1

On Tue, Apr 03, 2018 at 09:14:22PM +1200, Richard Hector wrote:
> On 03/04/18 20:55, Darac Marjal wrote:
> > If these things matter to you, it's better to convert from UTF-8 to
> > Unicode, first. I tend to think of Unicode as an arbitrarily large code
> > page. Each character maps to a number, but that number could be 1, 1000
> > or 500_000 (Unicode seems to be growing without might end in sight).
> > Internally, you might store those code points as Integers or QUad Words
> > or whatever you like. Only once you're ready to transfer the text to
> > another process (print on screen, save to a file, stream across a
> > network), do you convert the Unicode back into UTF-8.
> > 
> > Basically, you consider UTF-8 to be a transfer-only format (like
> > Base64). If you want to do anything non-trivial with it, decode it into
> > Unicode.
> Eh? UTF-8 is an encoding of Unicode. You can't "convert UTF-8 to
> Unicode" - it already is Unicode. You could convert it to another
> encoding, eg UTF-16 or UTF-32. Perhaps UTF-32 is what you mean, being
> fixed-width.

I think Darac was talking about UTF-32 [1], which is a fixed-width encoding
of Unicode. Yes, Unicode is strictly speaking the abstract "mapping" between
integers ("code points") and characters. A computer has no integers...

What's curious is that there's no UTF-24 (although Unicode currently has
all its code points below 2^21). That would make for a slightly more
compact fixed-width encoding.

I think these days fixed-width encodings are losing their charm a bit,
since memory access is getting much more expensive than CPU power.

Things might change once again when the Chinese dominate culturally,
since UTF-8 plays its advantage only with ASCII dominated text.

But perhaps then, another encoding will make more sense. Or just UTF-24
is born, for a 25% savings :-)


[1] https://en.wikipedia.org/wiki/UTF-32
- -- tomás

Version: GnuPG v1.4.12 (GNU/Linux)