Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,172 of 4,675    |
|    Terje Mathisen to All    |
|    Re: Palindromic number    |
|    08 Dec 17 23:31:24    |
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   I decided to go in and start writing a real 64-bit to ascii conversion,   
   I started by checking that I could use 32-bit code for all 7-digit numbers:   
      
   I use a scaled reciprocal for 1/1000000, so that I get a 52-bit product   
   which I then scale down with a SHRD to make a 4:28 fixed point results   
   that will fit in a 32-bit register.   
      
   This way I can extract each digit with 6 very simple instructions, for a   
   benchmark I timed generating all possible values from 0 to 9,999,999 and   
   got about 14.6 cycles/iteration, including the loop overhead which means   
   that I actually get a steady 3 instructions/cycle!   
      
   This only seems possible if the cpu runs MOV in zero cycles, using the   
   rename network instead since I get just two cycles/digit. I have checked   
   the actual asm code generated by the C test harness and it does seem   
   correct. I could obviously improve the output code by avoiding the 7   
   individual adds of '0' to convert from BCD to ascii, but this is not   
   part of the critical path and the cpu has enough execution units to do   
   this totally overlapped.   
      
   With three such conversions in parallel the total would probably not   
   improve too much, but even 47 cycles for the digit generating part would   
   make it likely that I could get close to my originally claimed 60   
   cycles. :-)   
      
   Terje   
      
   char *sevendigits_asm(uint32_t n, char *buf)   
   {   
    __asm {   
    mov eax,[n]   
    mov edx,281474977   
    mul edx   
    shrd eax,edx,20   
    inc eax   
    mov edi,[buf]   
      
    mov edx,eax   
    shr eax,28   
    and edx,0x0fffffff   
    add al,'0'   
    lea edx,[edx+edx*4]   
    mov [edi],al   
      
    mov eax,edx   
    shr edx,27   
    and eax,0x07ffffff   
    add dl,'0'   
    lea eax, [eax + eax * 4]   
    mov [edi+1], dl   
      
    mov edx, eax   
    shr eax, 26   
    and edx, 0x03ffffff   
    add al, '0'   
    lea edx, [edx + edx * 4]   
    mov [edi+2], al   
      
    mov eax, edx   
    shr edx, 25   
    and eax, 0x01ffffff   
    add dl, '0'   
    lea eax, [eax + eax * 4]   
    mov [edi + 3], dl   
      
    mov edx, eax   
    shr eax, 24   
    and edx, 0x00ffffff   
    add al, '0'   
    lea edx, [edx + edx * 4]   
    mov [edi+4], al   
      
    mov eax, edx   
    shr edx, 23   
    and eax, 0x007fffff   
    add dl, '0'   
    lea eax, [eax + eax * 4]   
    mov [edi + 5], dl   
      
    shr eax, 22   
    add ax, '0'   
    mov [edi+6], ax   
      
    }   
    return buf;   
   }   
      
      
   --   
   -
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca