Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,150 of 4,675    |
|    Terje Mathisen to Robert Prins    |
|    Re: More UTF-8 woes - UTF-8 to "\uN" RTF    |
|    02 Dec 17 23:01:24    |
      From: terje.mathisen@nospicedham.tmsw.no              Robert Prins wrote:       > // if cur_char <= 194 then       >       > cmp al, 194       > jb @06       >       > mov ecx, [esi + ebx]       > and ecx, $3f3f3fff       > mov dword ptr c, ecx // c: array [1..4] of char       >       > inc ebx       >       > // if cur_char <= 223 then       >       > cmp eax, 223       > jg @03       >       > // _u:= ((cur_char and $0000001f) shl 6) or       > // (longint(c[2]) and $0000003f);       >       > and eax, $1f       > shl eax, 6       > movzx ecx, byte ptr c[1]       > jmp @05       >       > @03:       > inc ebx       >       > // if cur_char <= 239 then       [snip]       > Probably can be improved, feel free to make suggestions.              Why don't you decode utf8 chars based on the number of leading 1 bits?              I.e. it should be possible to have a single decoder which picks up the       relevant number of trailing bytes based on the first char. The following       (totally untested!) code assumes no error in the inputs:               movzx eax, byte ptr [esi]        inc esi        cmp al,0C0h        jb plain_ascii               mov edx,040h ; This mask bit corresponds to the second byte              next_trailing_byte:        shl eax,6        movzx ebx, byte ptr [esi]        inc esi               shl edx,5 ; We get 5 more bits in total for each utf8 byte        and ebx,63               or eax,ebx               test eax,edx ; More tail bytes?        jnz next_trailing_byte               dec edx ; Turn the position of the zero bit into a mask        and eax,edx              ; Do any special stuff for utf8 wide chars here       ; ...              plain_ascii:              Terje       --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca