Web lists-archives.com

[Mingw-users] Idea/Discussion of unicode filepath support for C++ STL on Windows

First off, I'm not sure I'm on the right list... if this is not the place please pardon me and 
I would be much obliged if you could point me in the right direction. 

I've exhausted my Google-Fu trying to find a good solution to reading files with Unicode
file paths and complex characters on MinGW. A short summary of what I've found:

1) MSVC provides non-standard wchar_t* overloads for relevant methods that open files
     in the C++ STL. These are not available on GCC/libstdc++ nor MinGW. I can't find the link
     anymore but I recall someone making a patch to libstdc++ and submitting it to implement
     these overloads but it was ultimately rejected.
2) One can use the "short name hack" to convert the name to DOS8.3 format and pray that
     there will not be a name collision in the converted namespace. This is brittle at best, and
     I heard a rumour that this functionality might be going away in the future. 
3) The Microsoft CRT provides _wfopen that takes UTF-16 encoded wchar_t paths which can
     be used for opening complex paths. One suggestion is to use this and go back to C-style IO.

Out of the above #3 seems the most reliable but possibly least desirable solution. 

I've groked around the libstdc++ source and after some detective work I've found that fstreams
(through some intermediaries) eventually will refer to `__basic_file` for the actual file IO operations. 

See: https://github.com/gcc-mirror/gcc/blob/e11be3ea01eaf8acd8cd86d3f9c427621b64e6b4/libstdc%2B%2B-v3/config/io/basic_file_stdio.cc#L230

To me it looks like an easy patch to add something to the tune of:

#if defined _WIN32
    auto __inlen = strlen(__file_name) + 1; // Add null byte to be processed
    wchar_t* __buffer = new wchar_t [__inlen]; // UTF-8 string will have at most as many code points as bytes. 
    if(0 == MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, __file_name, __inlen, __buffer, __inlen)){
        delete [] __buffer;
    _M_cfile= _wfopen(__buffer, __c_mode);
    delete [] __buffer;
#else if defined _GLIBCXX_USE_LFS
If I'm not mistaken this change would change make any function that accepts a const char* filename 
become UTF-8 aware. The downside is that it changes the current (undocumented, and unspecified)
behaviour from using the current Active Code Page for character encoding in const char* filenames to 
being UTF-8.

This might break some applications that rely on this undocumented feature; However it might also fix
some applications that are currently assuming that fstreams etc are UTF-8 capable as they are on Linux.

I'm looking for comments; what do you think of such a change and whether it is any idea of
trying to pursue an attempt to get it into MinGW (or upstream?).

Would you find it motivated to have some (hopefully minor) breakage of undocumented features wrt 
ACP paths in exchange for UTF-8 support? Which would be in line with Microsofts recommendations to
use UTF-8 or UTF-16 when possible.
If not, would you consider it OK if it could be enabled by setting locale or codepage? For example if 
std::locale() contains "UTF-8" then use the above conversion otherwise use old behaviour?
The standard locale on startup is "C" on windows AFAICT. Other ideas?

(p.s. the archive search isn't working for me, search.gmane.org can't be reached)

Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
MinGW-users mailing list

This list observes the Etiquette found at 
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

You may change your MinGW Account Options or unsubscribe at:
Also: mailto:mingw-users-request@xxxxxxxxxxxxxxxxxxxxx?subject=unsubscribe