... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c++.moderated
Moderated discussion of C++ superhackery
33,346 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 31,793 of 33,346
Alf P. Steinbach to Jean-Marc Bourguet
Re: unicode and string
14 Jan 12 13:47:13
   From: alf.p.steinbach+usenet@gmail.com   
      
   On 14.01.2012 03:25, Jean-Marc Bourguet wrote:   
   > "Alf P. Steinbach"  writes:   
   >   
   >> C++11 §2.14.5/7 defines the type of an u8 string literal as an array of   
   >> char`. That works nicely for the *nix world, where `char` now by default   
   >> means UTF-8 encoding. In Windows, however, `char` means Windows ANSI   
   >> encoding (e.g., that's the execution character set for MSVC).   
   >   
   > char in C++ means encoded in a multibyte statefull encoding which depend on   
   > the locale.  In "C" locale it is often just ASCII (7 bits).   
      
   Yes, you can say that the problem resides with the standard not   
   reflecting and catering for the in-practice.   
      
      
   >> That means that the C++ type checking does not prevent you ending up with   
   >> gobbledygook, treating an UTF-8 encoded string as Windows ANSI.   
   >   
   > What is ANSI?  Code page 1250? 1251? 1252? 1253? Something else (Shift JIS   
   > for instance)?  One of those dependent on the Windows version and   
   > configuration? I fear it is the later one.   
      
   Right you are: it's a locale dependent encoding.   
      
   As a result, with Visual C++ having that encoding as its C++ narrow   
   character execution character set, the executable that you get when you   
   build with the Visual C++ compiler depends on the configured locale,   
   i.e. the same binary source code produces different binary executables.   
      
   Oh, that was just trivia, but it serves to illustrate that this is truly   
   a mess, not just in the C++ standard.   
      
      
   >> One might say that the `char` type was inadvertently too much overloaded   
   >> with meanings (default single-byte character set encoding value type,   
   >> byte, UTF-8 value type), but given that the problems with overloaded   
   >> `char` meanings are well known the addition of an extra meaning that will   
   >> only surface as problematic in Windows, smells a bit of politics to me --   
   >> and if so it probably means: difficult to fix...   
   >   
   > In the standard model (from C89 and Amd1 in 94/95) there was never an   
   > intention to have the encoding part of the type.  Just two encodings per   
   > locale, a statefull multibyte one in char, a one unit per character one in   
   > wchar_t.  And the precise encoding choice has been dependent on the locale   
   > for as long as there was encoding support in C and C++ (you can setup   
   > things so that wchar_t in some locales is related to one of the EUC   
   > encodings -- those are variable length but IIRC just prepending 0 bytes to   
   > make them fixed width will work --, in other UTF-32 for instance).  The   
   > encoding used for litterals has always been implementation dependant.   
      
   The encoding for `char` literals is implementation dependent yes,   
   because the C++ execution character set is implementation dependent.   
      
   But you could rely on the encoding for `char` literals being the C++   
   execution character set.   
      
   Now with C++11 you can not rely on that.   
      
      
   > IMHO, the problem isn't choosing between alternate models better suited to   
   > the current days, it is finding one which provides a path of transition   
   > from the current one without impacting those who depend on it.   
      
   That sentence sounds as if finding a better way would be somehow   
   difficult. Well that's meaningless and highly misleading: the C++11   
   standard does employ a better way for the other new prefixes, just not   
   for "u8", which will probably cause a bit of trouble for Windows   
   programmers. The sentence above also also sounds as if finding a better   
   way is somehow in conflict with finding something better suited to the   
   current, and also that is meaningless and highly misleading.   
      
      
   Cheers & hth.,   
      
   - Alf   
      
      
   --   
         [ See http://www.gotw.ca/resources/clcm.htm for info about ]   
         [ comp.lang.c++.moderated.    First time posters: Do this! ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]