From: alf.p.steinbach+usenet@gmail.com   
      
   On 20.12.2011 21:07, Chris Vine wrote:   
   > On Sun, 18 Dec 2011 00:48:58 -0800 (PST)   
   > Vadim Zeitlin wrote:   
   > [snip]   
   >> AFAICS file_system library uses the current locale encoding. This is   
   >> a perfectly reasonable choice (and what wxWidgets does too, FWIW) but   
   >> I can't help feeling that using UTF-8 could be nicer in many cases.   
   >> Unfortunately, short of waiting for the new UTF-8 literals in C++1x,   
   >> I don't see how could this be implemented in a nice way. One   
   >> possibility would be to try to interpret char* strings as UTF-8 and   
   >> fall back to current locale encoding if this fails but this seems   
   >> dangerous.   
   >>   
   >> Does anybody have any thoughts about this, are there any libraries   
   >> that assume UTF-8 for their char* input and, if so, what are their   
   >> experience with it?   
   >   
   > There are two different issues here. One is the encoding for narrow   
   > string literals in a particular program. In C++98/03, this depends on   
   > how your particular code editor stores its narrow characters: if it   
   > stores them in utf-8 encoding then that same code sequence will be   
   > stored in the binary; if it uses the locale narrow encoding (if   
   > different) then the locale narrow encoding will be stored in the   
   > binary.   
      
   No, that's not how it works.   
      
   The C++ standard requires the compiler to translate narrow character literals   
   from the source code encoding, to the compiler's narrow C++ execution   
   character set.   
      
   In C++11 this is specified in §2.2/5:   
      
    "Each source character set member in a character literal or a string   
    literal, as well as each escape sequence and universal-character-name   
    in a character literal or a non-raw string literal, is converted to   
    the corresponding member of the execution character set (2.14.3,   
    2.14.5); if there is no corresponding member, it is converted to an   
    implementation-defined member other than the null (wide) character."   
      
   For example, Visual C++ has (unfortunately undocumented) Windows ANSI as its   
   narrow character execution character set, and UTF-16 as its wide character   
   execution character set.   
      
   And thus, whether the source code is Windows ANSI encoded, or UTF8 with BOM,   
   which are two of the source code encodings that the Visual C++ compiler   
   recognizes, the narrow character literals end up as Windows ANSI encoded   
   values in the executable.   
      
   The g++ compiler has special options to specify these encodings, but alas,   
   they're not well documented (in particular the syntax).   
      
      
   > With a good code editor, you can set the narrow encoding it   
   > will use to match the expectations of the code you are writing.   
      
   No, don't rely on that.   
      
   Even if you can fool the compiler, the compiler may then in turn fool you by   
   garbling wide character literals.   
      
   [snip]   
      
      
   Cheers & hth.,   
      
   - Alf   
      
      
   --   
    [ See http://www.gotw.ca/resources/clcm.htm for info about ]   
    [ comp.lang.c++.moderated. First time posters: Do this! ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|