Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,018 of 4,675    |
|    Robert Prins to Terje Mathisen    |
|    Re: Translating blocks of memory contain    |
|    11 Oct 17 17:03:23    |
   
   From: robert@nospicedham.prino.org   
      
   On 2017-10-11 12:09, Terje Mathisen wrote:   
    > Robert Wessel wrote:   
    >> On Wed, 11 Oct 2017 09:29:35 +0200, Terje Mathisen   
    >>> The latter is actually quite easy to do since all such multi-byte   
    >>> utf8 chars will have a prefix from a small range of chars, followed   
    >>> by zero or more intermediate chars and a final character which is   
    >>> always from a separate set than those previous. This feature of the   
    >>> encoding is so that you can always know where you are if you start   
    >>> at a random byte of a long utf8 text.   
    >>   
    >> That's worded a bit ambiguously: All the bytes in a UTF-8 sequence   
    >> after the first byte will start with 10 - there's no distinction   
    >> between the intermediate and final bytes.   
    >   
    > OK, thanks. I misremembered the code I wrote many years ago. :-(   
    >   
    > The first char is from a unique set, and the leading bits in it determines   
   how   
    > many following bytes it will need. Each of those bytes are also from a   
   separate   
    > unique set, so it is easy to verify that the encoding is correct.   
    >   
    > What this means is that starting from a random byte you can determine where   
   that   
    > character (utf code point) starts like this:   
    >   
    > ; SI -> starting search pos, return with AL = first char   
    > ; and SI-> at next byte   
    >   
    > lodsb   
    > test al,al   
    > jns done ; 7-bit ascii   
    >   
    > check_first:   
    > cmp al,0C0h ; 011000000b   
    > jae done ; First char of UTF8   
    >   
    > ; [80h .. 0BF ] = secondary byte   
    >   
    > prev:   
    > mov al,[si-2]   
    > sub si   
    >   
    > test al,al   
    > js check_first   
    >   
    > error: ; Not a UTF-8 sequence!   
    >   
    > We could also write a small function to verify a given code point and   
   return the   
    > numeric value and the length in bytes:   
    >   
    > unicode getutf8(byte *src, unsigned *len)   
    > {   
    > byte *s = src;   
    > unicode u = *s++;   
    > *len = 0; // Error flag for bad encoding!   
    >   
    > if (u > 0x80) {   
    > if (u < 0xC0) return 0; // Not a UTF-8 encoding!   
    >   
    > signed char leading_bits = (signed char) (u + u); // Sign bit set!   
    > unsigned mask = 0x3F; // Bits to keep   
    >   
    > while (leading_bits < 0) {   
    > unicode f = (*s++) ^ 0x80;   
    > if (f >= 0x40) return 0; // Invalid encoding!   
    >   
    > u = (u << 6) | f;   
    > leading_bits += leading_bits;   
    > mask = (mask << 5) | 31; // Each byte adds 5 more bits   
    > }   
    > // Get rid of the initial prefix   
    > u &= mask;   
    >   
    > // In order to be canonical, the encoding must use   
    > // as few bytes as possible, so check if it would   
    > // have fit in the previous encoding:   
    > if (u <= (mask >> 5)) return u; // len == 0 still   
    > }   
    > *len = s - src;   
    > return u;   
    > }   
    >   
    > There is one relatively obvious bug in the error checking here, can any of   
   you   
    > spot it?   
      
   This is what I've used while working on it:   
      
   https://en.wikipedia.org/wiki/UTF-8   
   http://canonical.org/~kragen/strlen-utf8.html   
   https://web.archive.org/web/20160314125024/https://porg.es/blog/   
   idiculous-utf-8-character-counting   
   http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html   
      
   Robert   
   --   
   Robert AH Prins   
   robert(a)prino(d)org   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca