Web lists-archives.com

Re: [Mingw-users] Idea/Discussion of unicode filepath support for C++ STL on Windows




> From: Emily Leiviskä <emily.leiviska@xxxxxxxxxxxxxxx>
> Date: Fri, 7 Oct 2016 07:59:18 +0000
> 
> #if defined _WIN32
>     auto __inlen = strlen(__file_name) + 1; // Add null byte to be processed
>     wchar_t* __buffer = new wchar_t [__inlen]; // UTF-8 string will have at most as many code points as bytes. 
>     if(0 == MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, __file_name, __inlen, __buffer, __inlen)){
>         delete [] __buffer;
>         set_fail();
>         return;
>     }
>     _M_cfile= _wfopen(__buffer, __c_mode);
>     delete [] __buffer;
>     if(_M_c_file)
> #else if defined _GLIBCXX_USE_LFS
>     
> If I'm not mistaken this change would change make any function that accepts a const char* filename 
> become UTF-8 aware. The downside is that it changes the current (undocumented, and unspecified)
> behaviour from using the current Active Code Page for character encoding in const char* filenames to 
> being UTF-8.
> 
> This might break some applications that rely on this undocumented feature; However it might also fix
> some applications that are currently assuming that fstreams etc are UTF-8 capable as they are on Linux.
> 
> I'm looking for comments; what do you think of such a change and whether it is any idea of
> trying to pursue an attempt to get it into MinGW (or upstream?).
> 
> Would you find it motivated to have some (hopefully minor) breakage of undocumented features wrt 
> ACP paths in exchange for UTF-8 support? Which would be in line with Microsofts recommendations to
> use UTF-8 or UTF-16 when possible.
> If not, would you consider it OK if it could be enabled by setting locale or codepage? For example if 
> std::locale() contains "UTF-8" then use the above conversion otherwise use old behaviour?
> The standard locale on startup is "C" on windows AFAICT. Other ideas?

Doing this only for file-open functionality is a start, but it is a
partial solution at best.  Applications that manipulate file names
almost always need to do string processing with file names, like
finding only the base part of a file name, constructing a full
absolute file name from a directory, a file name, and an extension,
comparing file names in case-insensitive manner, etc.  All of this
will become subtly broken if you use UTF-8 encoded strings, because
Windows locales cannot use UTF-8 as their codeset, which means
functions like isalpha, isupper, strcasecmp, strcoll, mbstowcs,
etc. will not work for any non-ASCII character encoded as a UTF-8
sequence.

So if we want to make MinGW Unicode-compatible, we need to have
locale-aware functions that support UTF-8, which means replacements
for all of them, starting with 'setlocale'.  Anything less than that
will get us semi-broken implementation full of caveats.

I do agree that this is the right direction, though.  I just think
that more than a single API needs to be fixed for it to become a
reliable feature.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
MinGW-users mailing list
MinGW-users@xxxxxxxxxxxxxxxxxxxxx

This list observes the Etiquette found at 
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:mingw-users-request@xxxxxxxxxxxxxxxxxxxxx?subject=unsubscribe