... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,152 of 4,675
Terje Mathisen to Alex
Re: More UTF-8 woes - UTF-8 to "\uN" RTF
05 Dec 17 08:52:17
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Alex wrote:   
   > On 02-Dec-17 22:01, Terje Mathisen wrote:   
   >>   
   >> Why don't you decode utf8 chars based on the number of leading 1 bits?   
   >   
   > I didn't invent this and there may be already someone who's posted   
   > something along these lines, but you can do this below. The example   
   > code counts characters in a null terminated UTF-8 string pointed at   
   > by ECX, result in EAX.   
   >   
   >        or      eax -1   
   > @@1:   add     eax 1                \ add 1   
   > @@2:   add     ecx 1                \ next char   
   >        movzx   edx byte { ecx }     \ fetch the byte   
   >        shl     edx 25               \ shift sign bit out to carry   
   >        js      short @@1               \ x1xxxxxx, count   
   >        jc      short @@2               \ 10xxxxxx, don't count   
   >        jnz     short @@1               \ <>0, count   
   >   
   > That is, shift the high order XX........ into the carry & sign bits   
   > and use them to branch. Without testing, I have not a clue how   
   > performant or otherwise this code might be.   
   >   
      
   First of all, there is no need to use the entire EDX register, it would   
   be perfectly fine to keep it in DL:   
      
   count:   
   	inc eax   
   skip:   
   	mov dl,[esi]   
   	inc esi   
   	add dl,dl   
   	 js count   
   	 jc skip   
   	 jnz count   
   ; Found \0 end of C string marker   
      
   The main problem is of course the fact that all those (possibly hard to   
   predict?) branches would take some time. We could instead use the bit   
   patterns to do the counting, i.e.   
      
   	cnt += (((c >> 6) ^ 1) & (c >> 5)) & 1;   
      
   This is because we want to increment the count for all patterns except   
   10 in the two leading bits, so by inverting the top bit and and'ing that   
   with the next bit we get a 0 only for a trailing utf-8 byte.   
      
   We can then do the counting 16 or 32 bytes at once with SSE/AVX parallel   
   byte operations, it looks like 6-7 instructions per block, so about 2.5   
   to 5 bytes/cycle.   
      
   After up to a max of 255 blocks (i.e. either 4080 or 8160 bytes) we have   
   to collect the 16/32 individual byte-sized counters into wider   
   accumulators to avoid potential overflows.   
      
   Just doing the exact same thing with a pair of 64-bit integer registers   
   would also be quite a bit faster than the branching code, and that would   
   also work as perfectly portable C code that handles multiple bytes/cycle:   
      
      mov r15,0x8080808080808080 ; Useful bit mask!   
   ...   
      
   next_16:   
      mov rdx,[rsi]   
      mov r9,[rsi+8]   
      add rsi,16   
      
      lea rbx,[rdx*2]   
      xor rdx,r15		;; Invert the top bits   
      lea r10,[r9*2]   
      xor r9,r15   
      
      and rdx,rbx   
      and r9,r10   
      
      and rdx,r15   
      and r9,r15   
      
      shr rdx,7   
      shr r9,7   
      
      add rax,rdx   
      add r8,r9   
      
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]