Re: G_UTF8String: Boxed Type Proposal
- Date: Thu, 17 Mar 2016 11:26:21 -0700
- From: "Jasper St. Pierre" <jstpierre@xxxxxxxxxxx>
- Subject: Re: G_UTF8String: Boxed Type Proposal
I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 ? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.
On Thu, Mar 17, 2016 at 11:18 AM, Randall Sawyer
> On 03/17/2016 10:39 AM, Randall Sawyer wrote:
>> On 03/17/2016 09:30 AM, Matthias Clasen wrote:
>>>> I believe that you haven't found such a proposal because most people
>>>> don't see much use in a separate boxed type for utf8 strings. Every
>>>> string we pass around in GLib and GTK+, and every char * in their APIs
>>>> is expected to be in utf8. The few exceptions to this rule are
>>>> explicitly documented.
>>> GLib already provides a number of utilities for dealing with utf8
>>> strings in terms of characters, such as g_utf8_strlen,
>>> g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
>>> adding to that list, if there are glaring omissions.
>> Here is the vision: Once raw string data - or gunichar value - has been
>> passed and validated into the construction of a "G_UTF8String" structure,
>> then contents of two-or-more of these can be easily combined without the
>> need for additional measuring or validating.
> Alright Matthias, after your thoughtful response, I have come to the
> following conclusion: When considering management of dynamically allocated
> UTF-8 strings, there are actually two points to consider: 1) Whether the
> byte sequences are valid per IETF RFC 3629 Section 4 - and - 2) The number
> of distinct characters represented in the string vs. the total number of
> bytes used to represent these.
> If someone were to write a widget library or an application using libraries
> which ensure valid UTF-8 as input - Gdk key-press events and GtkIMContexts
> for example - then it wouldn't make sense to run those strings through yet
> another course of validation. That addresses the first issue.
> There is still the question of character length vs. byte length.
> Therefore - from what you have told me - I will be sure to present methods
> which feature validation as an option and not as the rule.
> Thank you.
> gtk-devel-list mailing list
gtk-devel-list mailing list