... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,018 of 4,675
Robert Prins to Terje Mathisen
Re: Translating blocks of memory contain
11 Oct 17 17:03:23
   From: robert@nospicedham.prino.org   
      
   On 2017-10-11 12:09, Terje Mathisen wrote:   
    > Robert Wessel wrote:   
    >> On Wed, 11 Oct 2017 09:29:35 +0200, Terje Mathisen   
    >>> The latter is actually quite easy to do since all such multi-byte   
    >>> utf8 chars will have a prefix from a small range of chars, followed   
    >>> by zero or more intermediate chars and a final character which is   
    >>> always from a separate set than those previous. This feature of the   
    >>> encoding is so that you can always know where you are if you start   
    >>> at a random byte of a long utf8 text.   
    >>   
    >> That's worded a bit ambiguously:  All the bytes in a UTF-8 sequence   
    >> after the first byte will start with 10 - there's no distinction   
    >> between the intermediate and final bytes.   
    >   
    > OK, thanks. I misremembered the code I wrote many years ago. :-(   
    >   
    > The first char is from a unique set, and the leading bits in it determines   
   how   
    > many following bytes it will need. Each of those bytes are also from a   
   separate   
    > unique set, so it is easy to verify that the encoding is correct.   
    >   
    > What this means is that starting from a random byte you can determine where   
   that   
    > character (utf code point) starts like this:   
    >   
    > ; SI -> starting search pos, return with AL = first char   
    > ; and SI-> at next byte   
    >   
    >    lodsb   
    >    test al,al   
    >     jns done    ; 7-bit ascii   
    >   
    > check_first:   
    >    cmp al,0C0h    ; 011000000b   
    >     jae done    ; First char of UTF8   
    >   
    > ; [80h .. 0BF ] = secondary byte   
    >   
    > prev:   
    >    mov al,[si-2]   
    >    sub si   
    >   
    >    test al,al   
    >     js check_first   
    >   
    > error:    ; Not a UTF-8 sequence!   
    >   
    > We could also write a small function to verify a given code point and   
   return the   
    > numeric value and the length in bytes:   
    >   
    > unicode getutf8(byte *src, unsigned *len)   
    > {   
    >    byte *s = src;   
    >    unicode u = *s++;   
    >    *len = 0; // Error flag for bad encoding!   
    >   
    >    if (u > 0x80) {   
    >      if (u < 0xC0) return 0; // Not a UTF-8 encoding!   
    >   
    >      signed char leading_bits = (signed char) (u + u); // Sign bit set!   
    >      unsigned mask = 0x3F;    // Bits to keep   
    >   
    >      while (leading_bits < 0) {   
    >        unicode f = (*s++) ^ 0x80;   
    >        if (f >= 0x40) return 0; // Invalid encoding!   
    >   
    >        u = (u << 6) | f;   
    >        leading_bits += leading_bits;   
    >        mask = (mask << 5) | 31; // Each byte adds 5 more bits   
    >      }   
    >      // Get rid of the initial prefix   
    >      u &= mask;   
    >   
    >      // In order to be canonical, the encoding must use   
    >      // as few bytes as possible, so check if it would   
    >      // have fit in the previous encoding:   
    >      if (u <= (mask >> 5)) return u; // len == 0 still   
    >    }   
    >    *len = s - src;   
    >    return u;   
    > }   
    >   
    > There is one relatively obvious bug in the error checking here, can any of   
   you   
    > spot it?   
      
   This is what I've used while working on it:   
      
   https://en.wikipedia.org/wiki/UTF-8   
   http://canonical.org/~kragen/strlen-utf8.html   
   https://web.archive.org/web/20160314125024/https://porg.es/blog/   
   idiculous-utf-8-character-counting   
   http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html   
      
   Robert   
   --   
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]