home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,150 of 4,675   
   Terje Mathisen to Robert Prins   
   Re: More UTF-8 woes - UTF-8 to "\uN" RTF   
   02 Dec 17 23:01:24   
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Robert Prins wrote:   
   >   // if cur_char <= 194 then   
   >   
   >   cmp   al, 194   
   >   jb    @06   
   >   
   >   mov   ecx, [esi + ebx]   
   >   and   ecx, $3f3f3fff   
   >   mov   dword ptr c, ecx  // c: array [1..4] of char   
   >   
   >   inc   ebx   
   >   
   >   // if cur_char <= 223 then   
   >   
   >   cmp   eax, 223   
   >   jg    @03   
   >   
   >   // _u:= ((cur_char      and $0000001f) shl 6) or   
   >   //       (longint(c[2]) and $0000003f);   
   >   
   >   and   eax, $1f   
   >   shl   eax, 6   
   >   movzx ecx, byte ptr c[1]   
   >   jmp   @05   
   >   
   > @03:   
   >   inc   ebx   
   >   
   >   // if cur_char <= 239 then   
   [snip]   
   > Probably can be improved, feel free to make suggestions.   
      
   Why don't you decode utf8 chars based on the number of leading 1 bits?   
      
   I.e. it should be possible to have a single decoder which picks up the   
   relevant number of trailing bytes based on the first char. The following   
   (totally untested!) code assumes no error in the inputs:   
      
      movzx eax, byte ptr [esi]   
      inc esi   
      cmp al,0C0h   
       jb plain_ascii   
      
      mov edx,040h	; This mask bit corresponds to the second byte   
      
   next_trailing_byte:   
      shl eax,6   
      movzx ebx, byte ptr [esi]   
      inc esi   
      
      shl edx,5	; We get 5 more bits in total for each utf8 byte   
      and ebx,63   
      
      or eax,ebx   
      
      test eax,edx	; More tail bytes?   
       jnz next_trailing_byte   
      
      dec edx	; Turn the position of the zero bit into a mask   
      and eax,edx   
      
   ; Do any special stuff for utf8 wide chars here   
   ; ...   
      
   plain_ascii:   
      
   Terje   
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca