-----BEGIN PGP SIGNED MESSAGE-----
On Tue, Apr 03, 2018 at 09:14:22PM +1200, Richard Hector wrote:
> On 03/04/18 20:55, Darac Marjal wrote:
> > If these things matter to you, it's better to convert from UTF-8 to
> > Unicode, first. I tend to think of Unicode as an arbitrarily large code
> > page. Each character maps to a number, but that number could be 1, 1000
> > or 500_000 (Unicode seems to be growing without might end in sight).
> > Internally, you might store those code points as Integers or QUad Words
> > or whatever you like. Only once you're ready to transfer the text to
> > another process (print on screen, save to a file, stream across a
> > network), do you convert the Unicode back into UTF-8.
> > Basically, you consider UTF-8 to be a transfer-only format (like
> > Base64). If you want to do anything non-trivial with it, decode it into
> > Unicode.
> Eh? UTF-8 is an encoding of Unicode. You can't "convert UTF-8 to
> Unicode" - it already is Unicode. You could convert it to another
> encoding, eg UTF-16 or UTF-32. Perhaps UTF-32 is what you mean, being
I think Darac was talking about UTF-32 , which is a fixed-width encoding
of Unicode. Yes, Unicode is strictly speaking the abstract "mapping" between
integers ("code points") and characters. A computer has no integers...
What's curious is that there's no UTF-24 (although Unicode currently has
all its code points below 2^21). That would make for a slightly more
compact fixed-width encoding.
I think these days fixed-width encodings are losing their charm a bit,
since memory access is getting much more expensive than CPU power.
Things might change once again when the Chinese dominate culturally,
since UTF-8 plays its advantage only with ASCII dominated text.
But perhaps then, another encoding will make more sense. Or just UTF-24
is born, for a 25% savings :-)
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
-----END PGP SIGNATURE-----