home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 241,976 of 243,242   
   Michael Sanders to Mikko   
   Re: Unicode...   
   15 Nov 25 19:09:16   
   
   From: porkchop@invalid.foo   
      
   On Sat, 15 Nov 2025 12:47:03 +0200, Mikko wrote:   
      
   > On 2025-11-14 21:03:38 +0000, Michael Sanders said:   
   >   
   >> Well, I finally got bitten by Unicode.   
   >>   
   >> Managed a work around, but I don't have enough experience   
   >> with Unicode to know just exactly what I'm doing...   
   >>   
   >> #include    
   >> #include    
   >>   
   >> static int utf8_width(const char *s) {   
   >>     int w = 0;   
   >>     const unsigned char *p = (const unsigned char *)s;   
   >>   
   >>     while (*p) {   
   >>         if (*p < 0x80) { w++; p++; } // ASCII 1-byte   
   >>         else if ((*p & 0xE0) == 0xC0) { w++; p += 2; } // 2-byte UTF-8   
   >>         else if ((*p & 0xF0) == 0xE0) { w++; p += 3; } // 3-byte UTF-8   
   >>         else if ((*p & 0xF8) == 0xF0) { w++; p += 4; } // 4-byte UTF-8   
   >>         else { w++; p++; } // fallback   
   >>     }   
   >>   
   >>     return w;   
   >> }   
   >   
   > The code above may cause problems if the argument string is not well   
   > formed UTF-8. For example, the zero terminator coud be missed. Of   
   > course an invalid tring can be expected to cause problems anyway but   
   > some errors are harder to debug than others.   
   >   
   > Another way is   
   >   
   > static int utf8_width(const char *s) {   
   >     int w = 0;   
   >     const unsigned char *p = (const unsigned char *)s;   
   >   
   >     while (*p) {   
   >       if ((*p & 0xC0) != 0x80) w++; // count the first bytes of each   
   character   
   >     }   
   >   
   >     return w;   
   > }   
   >   
   > One could also add a check that each character has the right number of   
   > bytes of the right kind and if not regard that as the end of the string.   
      
   Excellent I've added your reply to my notes, thank you Mikko.   
      
   --   
   :wq   
   Mike Sanders   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca