... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,016 of 4,675
Terje Mathisen to Robert Wessel
Re: Translating blocks of memory contain
11 Oct 17 14:09:45
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Robert Wessel wrote:   
   > On Wed, 11 Oct 2017 09:29:35 +0200, Terje Mathisen   
   >> The latter is actually quite easy to do since all such multi-byte   
   >> utf8 chars will have a prefix from a small range of chars, followed   
   >> by zero or more intermediate chars and a final character which is   
   >> always from a separate set than those previous. This feature of the   
   >> encoding is so that you can always know where you are if you start   
   >> at a random byte of a long utf8 text.   
   >   
   > That's worded a bit ambiguously:  All the bytes in a UTF-8 sequence   
   > after the first byte will start with 10 - there's no distinction   
   > between the intermediate and final bytes.   
      
   OK, thanks. I misremembered the code I wrote many years ago. :-(   
      
   The first char is from a unique set, and the leading bits in it   
   determines how many following bytes it will need. Each of those bytes   
   are also from a separate unique set, so it is easy to verify that the   
   encoding is correct.   
      
   What this means is that starting from a random byte you can determine   
   where that character (utf code point) starts like this:   
      
    SI -> starting search pos, return with AL = first char   
    and SI-> at next byte   
      
      lodsb   
      test al,al   
       jns done	; 7-bit ascii   
      
   check_first:   
      cmp al,0C0h	; 011000000b   
       jae done	; First char of UTF8   
      
    [80h .. 0BF ] = secondary byte   
      
   prev:   
      mov al,[si-2]   
      sub si   
      
      test al,al   
       js check_first   
      
   error:	; Not a UTF-8 sequence!   
      
   We could also write a small function to verify a given code point and   
   return the numeric value and the length in bytes:   
      
   unicode getutf8(byte *src, unsigned *len)   
   {   
      byte *s = src;   
      unicode u = *s++;   
      *len = 0; // Error flag for bad encoding!   
      
      if (u > 0x80) {   
        if (u < 0xC0) return 0; // Not a UTF-8 encoding!   
      
        signed char leading_bits = (signed char) (u + u); // Sign bit set!   
        unsigned mask = 0x3F;	// Bits to keep   
      
        while (leading_bits < 0) {   
          unicode f = (*s++) ^ 0x80;   
          if (f >= 0x40) return 0; // Invalid encoding!   
      
          u = (u << 6) | f;   
          leading_bits += leading_bits;   
          mask = (mask << 5) | 31; // Each byte adds 5 more bits   
        }   
        // Get rid of the initial prefix   
        u &= mask;   
      
        // In order to be canonical, the encoding must use   
        // as few bytes as possible, so check if it would   
        // have fit in the previous encoding:   
        if (u <= (mask >> 5)) return u; // len == 0 still   
      }   
      *len = s - src;   
      return u;   
   }   
      
   There is one relatively obvious bug in the error checking here, can any   
   of you spot it?   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]