Web lists-archives.com

Re: G_UTF8String: Boxed Type Proposal

Thank you once again to all who have responded.

I have changed my mind.

I DO grasp the nature of responders' objections.

My understanding has now reached a "tipping point".

What is the tipping point?

On 03/21/2016 04:30 PM, Behdad Esfahbod wrote:
I like to voice my opinion as well:

  - Bundling data and its length in a boxed type is useful, but that's gblob,

  - Bundling number-of-Unicode-character is rarely useful indeed,

  - A string API that would require any changes to the string content to go through editing function calls is painful and will remain unused,

I also have a piece of a more personal opinion:  Many processes that simply *reject* invalid Unicode text are useless in many situations.  For example, gedit used to refuse to open a file if it had even a single invalidly-encoded byte.  I find that annoyingly limited.  Same thing about Pango.  Fortunately, both have been fixed for many years now.


On Mon, Mar 21, 2016 at 6:32 AM, Matthias Clasen <matthias.clasen@xxxxxxxxx> wrote:
On Fri, Mar 18, 2016 at 9:57 AM, Randall Sawyer
<srandallsawyer@xxxxxxxxxxx> wrote:

> 2) If the former is true - which it is - then the developer will need to
> call g_utf8_strlen() to determine if there are multi-byte sequences to
> navigate - and if there are - g_utf8_offset_to_pointer() to locate the array
> index. Doesn't this increase processing demand?

It does. But whether that is a problem (in general, or for your
particular use case) can only be answered by  profiling. My theory is
that you won't be able to notice this on the profile at all, unless
all your application does is constantly operating on large amounts of
text. In which case, you really shouldn't be using GString to begin

Matthias, I comprehend what you are saying here.

As Christian pointed out recently (https://mail.gnome.org/archives/gtk-devel-list/2016-March/msg00037.html), "DRY alone is not a sufficient argument."

> 3) Wouldn't it be helpful to keep track of how many code points
> ("characters")are stored in the GString - a number which may be less than
> the value of GString.len - without needing to call g_utf8_strlen() each time
> to find out?
> 4) Would my efforts be better spent editing patches of "gstring.h" and
> "gstring.c" - or - to proceed as I am to introduce a parallel alternative?

I think we haven't gotten past the 'what is the problem you are trying
to solve - and is it a problem in the first place ?' part yet.
gtk-devel-list mailing list

The tipping point is the function g_utf8_normalize() - which is called by objects which DO possess a length-of-string in units of UTF 8-code-points ("characters" in Glib parlance).

If my proposed idea were to be adopted in a useful way - then every call to any g_utf8_*() function would require that it be wrapped in a g_ustring_*() [previously g_utf8_string_*()] function in order for GUString [previously G_UTF8String] to be truly useful.

Time to move on.

Along the way - however - I have come up with two functions which I will be proposing and which may very well be useful in a number of certain cases:

g_utf8_unilen() - which measures the length of string in UTF-8 sequences ("characters") primarily and in non-nul bytes secondarily

g_utf8_offset_to_pointer_sized () - which optimizes its return value by by first comparing byte length to UTF-8 length [for the cases when these are both known] - opting for pointer arithmetic when equal - and then compares UTF-8 offset to UTF-8 length in order to decide whether to parse the first 3/4 of the last 1/4 when calling g_utf8_offset_to_pointer()

These last two, I will definitely be submitting as a patch.


gtk-devel-list mailing list