From: porkchop@invalid.foo   
      
   On Sat, 15 Nov 2025 12:47:03 +0200, Mikko wrote:   
      
   > On 2025-11-14 21:03:38 +0000, Michael Sanders said:   
   >   
   >> Well, I finally got bitten by Unicode.   
   >>   
   >> Managed a work around, but I don't have enough experience   
   >> with Unicode to know just exactly what I'm doing...   
   >>   
   >> #include    
   >> #include    
   >>   
   >> static int utf8_width(const char *s) {   
   >> int w = 0;   
   >> const unsigned char *p = (const unsigned char *)s;   
   >>   
   >> while (*p) {   
   >> if (*p < 0x80) { w++; p++; } // ASCII 1-byte   
   >> else if ((*p & 0xE0) == 0xC0) { w++; p += 2; } // 2-byte UTF-8   
   >> else if ((*p & 0xF0) == 0xE0) { w++; p += 3; } // 3-byte UTF-8   
   >> else if ((*p & 0xF8) == 0xF0) { w++; p += 4; } // 4-byte UTF-8   
   >> else { w++; p++; } // fallback   
   >> }   
   >>   
   >> return w;   
   >> }   
   >   
   > The code above may cause problems if the argument string is not well   
   > formed UTF-8. For example, the zero terminator coud be missed. Of   
   > course an invalid tring can be expected to cause problems anyway but   
   > some errors are harder to debug than others.   
   >   
   > Another way is   
   >   
   > static int utf8_width(const char *s) {   
   > int w = 0;   
   > const unsigned char *p = (const unsigned char *)s;   
   >   
   > while (*p) {   
   > if ((*p & 0xC0) != 0x80) w++; // count the first bytes of each   
   character   
   > }   
   >   
   > return w;   
   > }   
   >   
   > One could also add a check that each character has the right number of   
   > bytes of the right kind and if not regard that as the end of the string.   
      
   Excellent I've added your reply to my notes, thank you Mikko.   
      
   --   
   :wq   
   Mike Sanders   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|