... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.c++.moderated

Moderated discussion of C++ superhackery

33,346 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 31,790 of 33,346

Jean-Marc Bourguet to Alf P. Steinbach

Re: unicode and string

13 Jan 12 18:25:06

   From: jm@bourguet.org   

   "Alf P. Steinbach"  writes:   

   > C++11 §2.14.5/7 defines the type of an u8 string literal as an array of   
   > char`. That works nicely for the *nix world, where `char` now by default   
   > means UTF-8 encoding. In Windows, however, `char` means Windows ANSI   
   > encoding (e.g., that's the execution character set for MSVC).   

   char in C++ means encoded in a multibyte statefull encoding which depend on   
   the locale.  In "C" locale it is often just ASCII (7 bits).   

   > That means that the C++ type checking does not prevent you ending up with   
   > gobbledygook, treating an UTF-8 encoded string as Windows ANSI.   

   What is ANSI?  Code page 1250? 1251? 1252? 1253? Something else (Shift JIS   
   for instance)?  One of those dependent on the Windows version and   
   configuration? I fear it is the later one.   

   > One might say that the `char` type was inadvertently too much overloaded   
   > with meanings (default single-byte character set encoding value type,   
   > byte, UTF-8 value type), but given that the problems with overloaded   
   > `char` meanings are well known the addition of an extra meaning that will   
   > only surface as problematic in Windows, smells a bit of politics to me --   
   > and if so it probably means: difficult to fix...   

   In the standard model (from C89 and Amd1 in 94/95) there was never an   
   intention to have the encoding part of the type.  Just two encodings per   
   locale, a statefull multibyte one in char, a one unit per character one in   
   wchar_t.  And the precise encoding choice has been dependent on the locale   
   for as long as there was encoding support in C and C++ (you can setup   
   things so that wchar_t in some locales is related to one of the EUC   
   encodings -- those are variable length but IIRC just prepending 0 bytes to   
   make them fixed width will work --, in other UTF-32 for instance).  The   
   encoding used for litterals has always been implementation dependant.   

   IMHO, the problem isn't choosing between alternate models better suited to   
   the current days, it is finding one which provides a path of transition   
   from the current one without impacting those who depend on it.   

   Yours,   

   --   
   Jean-Marc   

         [ See http://www.gotw.ca/resources/clcm.htm for info about ]   
         [ comp.lang.c++.moderated.    First time posters: Do this! ]   

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]