Re: gtk_text_iter_forward_search() comparison

On Mon, 30 Jan 2017 13:53:46 +0200
"Andrew W. Nosenko" <andrew.w.nosenko@xxxxxxxxx> wrote:
> On Sun, Jan 29, 2017 at 2:16 AM, Eric Cashon via gtk-devel-list <
> gtk-devel-list@xxxxxxxxx> wrote:
> >
> > I have been working on a little search experimentation. Gave
> > writing a case in-sensitive gtk_text_iter_forward_search() a try.
> > The code is shorter than what is in gtktextiter.c and it works a
> > little faster. If a word is searched that isn't very frequent the
> > time is about the same. If you just look for single chars or words
> > that are frequent it looks to be quicker. Not sure if this a
> > suitable method though. I know little of the textbuffer internals.
> > UTF-8 gives me some trouble also.
> >
> > There is a test progam at
> >
> > https://github.com/cecashon/OrderedSetVelociRaptor/blob/
> > master/Misc/Csamples/search_textbuffer2.c
> >
> > that does a side by side comparison of the two search methods. If
> > there is an inherent problem with the test forward search please
> > say so. If not, maybe it can be used. I would be glad to work on it
> > a little more if corrections need to be made.
> >  
> Sorry, but your approach just doesn't work.
> You falsely assume that if bring two characters to the same case
> (both to lower or both to upper), then it's enough for
> case-insensetive search. While it's indeed enough for English, it's
> not true in the general case. Just try to compare "Straße" and
> "STRASSE", which mean "street" in German, using your code.  (The
> second string is an uppercased version of the first, so searching for
> one should match another.)

Problems like this can arise with unicode in any writing, including
English.  Printed words in modern English can have lower-case ligatures
similar to ß (which a few hundred years ago was also a ligature used in
printed English), viz:

field          - four code points

FIELD         - five code points

Comparing unicode strings is fraught with difficulty, including the
assumption that "character" is the same as "code point".

