Web lists-archives.com

Re: G_UTF8String: Boxed Type Proposal

On 03/17/2016 10:39 AM, Randall Sawyer wrote:

On 03/17/2016 09:30 AM, Matthias Clasen wrote:
I believe that you haven't found such a proposal because most people
don't see much use in a separate boxed type for utf8 strings. Every
string we pass around in GLib and GTK+, and every char * in their APIs
is expected to be in utf8. The few exceptions to this rule are
explicitly documented.
GLib already provides a number of utilities for dealing with utf8
strings in terms of characters, such as g_utf8_strlen,
g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
adding to that list, if there are glaring omissions.
Here is the vision: Once raw string data - or gunichar value - has been passed and validated into the construction of a "G_UTF8String" structure, then contents of two-or-more of these can be easily combined without the need for additional measuring or validating.

Alright Matthias, after your thoughtful response, I have come to the following conclusion: When considering management of dynamically allocated UTF-8 strings, there are actually two points to consider: 1) Whether the byte sequences are valid per IETF RFC 3629 Section 4 - and - 2) The number of distinct characters represented in the string vs. the total number of bytes used to represent these.

If someone were to write a widget library or an application using libraries which ensure valid UTF-8 as input - Gdk key-press events and GtkIMContexts for example - then it wouldn't make sense to run those strings through yet another course of validation. That addresses the first issue.

There is still the question of character length vs. byte length.

Therefore - from what you have told me - I will be sure to present methods which feature validation as an option and not as the rule.

Thank you.

gtk-devel-list mailing list