home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,466 of 243,242   
   BGB to Thiago Adams   
   Re: u8"" c11 c23   
   16 Dec 25 14:59:01   
   
   From: cr88192@gmail.com   
      
   On 10/20/2025 1:35 PM, Thiago Adams wrote:   
   > speaking on signed x unsigned,   
   >   
   > u8"a"  in C11 had the type char [N]. Normally char is signed   
   >   
   > in C23 it is unsigned char8_t  [N].   
   >   
   > when converting code from c11 to c23 we have a error here   
   > const char* s = u8""   
   >   
   >   
   >   
   >   
   >   
   >   
   > I generally "cast char* " to "unsigned char*" when handling something   
   > with utf8. I am not u8"" , I use just " " with utf8 encoded source code   
   > and I just assume const char*  is utf8.   
   >   
   It may not be so simple, as source-code bytes don't necessarily map 1:1   
   with string literal bytes (and are more likely to be translated than   
   passed through as-is).   
      
   Implicitly, it may depend on the default locale and similar assumed by   
   the C compiler.   
      
   If the source-code is UTF-8, and the default locale is UTF-8, then OK.   
      
   More conservative though is to assume that the default locale's   
   character encoding is potentially something like 8859-1 or 1252, which   
   will not preserve UTF-8 codepoints if not mapped into an area supported   
   by the relevant encoding (so, things may get remapped).   
      
   So, you need a UTF-8 string literal or similar to specify that the   
   string does in-fact encode text as UTF-8.   
      
      
      
   In a compiler, one may need to try to detect and deal with text   
   encoding, say:   
      ASCII text:   
        No BOM, limited range of characters   
          (0x20..0xx7E, 0x09, 0x0D, 0x0A, etc).   
      UTF-8:   
        Also Includes 80..EF   
        Only allow valid codepoint sequences   
        May include a BOM   
      8859-1 or 1252:   
        Includes 80..FF, excludes text which is also valid as UTF-8.   
        No BOM.   
        Other encodings possible,   
          Like 437 / KOI-8 / JIS / etc,   
          but far less common than 1252.   
          No good way to distinguish them reliably.   
      UTF-16 (*1):   
        Even number of bytes   
        Strongly hinted if even or odd bytes are frequently NUL;   
          Frequent even NUL: UTF-16, likely big-endian;   
          Frequent odd NUL: UTF-16, likely little-endian;   
        Excluded if matching the pattern for one of the above;   
          If text is valid ASCII or UTF-8, assume these instead.   
        May include a BOM.   
      
   *1: More commonly produced by older versions of Visual Studio or   
   Notepad, if a non-ASCII codepoint was present. Newer versions tend to   
   default to UTF-8 instead.   
      
   Compiler may normalize on UTF-8 or similar internally, but this again   
   doesn't mean it can be assumed for string literals (which are more   
   likely to be mashed into 1252 or something, such as with a compiler like   
   MSVC).   
      
      
   Though, that said, does seem that GCC defaults to assuming UTF-8 if   
   nothing else is specified. So, UTF-8 => UTF-8 with default string   
   literals may be workable if one also assumes that the code is always   
   compiled with GCC or similar.   
      
   Though, curiously, it seems newer MSVC will still use UTF-8 with a   
   default string literal if the character is given as "\uXXXX", but will   
   use a single-byte encoding in other cases.   
      
   Checking, newer versions of MSVC are also aware of u8 literals.   
      
   ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca