Re: utf8 options under Mysql
- Date: Fri, 22 Apr 2016 10:49:30 +0200
- From: Jigal van Hemert <jigal@xxxxxxxxx>
- Subject: Re: utf8 options under Mysql
On 22/04/2016 04:50, Martin Mueller wrote:
A collation contains definitions for sorting order and comparison. For
most purposes one wants "crème brûlée" to be the same as "creme brulee".
For unicode characters these rules can be complex. A character set (in
your case UTF-8) defines which character can be stored.
MySQL has a bewildering variety of unicode collation choices. Most of them are language specific, but what is the difference between "utf8-general-ci", "utf8-unicode-ci", and "utf8-unicode-520-ci." Do they differ in the range of characters they can handle or is it just a matter of the cort order. I understand that utf8-bin is different because it is case sensitive, but the other differences elude me.
Under what circumstances does it make a difference to use on or the other? I work with a lot of Early Modern print data and the weird symbols of various kinds they use. I've had trouble at times with the "utf8-general-ci" setting, but it may have been more a matter of settings on my front end tool than of the choice of this rather than unicode collation.
Under character sets, there is just one utf8 setting. The simplest way to make sense of the choices would be to say that given a character set (utf8) the collation only makes a difference to the sort but makes no difference to what can be displayed. Is that correct.
utf8-general-ci contains a simplified version of those conversion rules.
It works for a lot of Western European languages very well, but in some
cases there are problems. For Asian languages there are a lot more
problems. For example, 'ß' isn't considered the same as 'ss'.
utf8-unicode-ci has more complex rules and works fine for more
languages. Due to the more complex rule set it is a bit slower than
utf8-unicode-520-ci uses a newer version of the rule set that is used in
Other utf8-* collations may contain specific rules for specific languages
utf8-general-ci is the default collation for utf-8 in MySQL. If you use
literal strings MySQL may assume that these have the default collation
and comparing them to columns with other collations or performing things
like cast operations may produce errors about invalid combinations of
Met vriendelijke groet,
Jigal van Hemert.
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql