... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,505 of 4,675
Bernhard Schornak to Terje Mathisen
Re: Fast Fizz Buzz program
23 Jul 18 11:24:59
   From: schornak@nospicedham.web.de   
      
   Terje Mathisen wrote:   
      
      
   > Bernhard Schornak wrote:   
   >> Why not SSE registers? Using a fixed width of 16 byte can serve   
   >> all possible characters including LF, CR or CRLF. With variable   
   >> width, you had to play around with the content of XMM registers   
   >> and MOVDQU rather than MOVDQA (introducing penalties for uneven   
   >> addresses not divisible by 16), but this applies to all integer   
   >> registers as well, because unaligned memory accesses always are   
   >> punished with extra cycles. (Which might be the reason why your   
   >> attempts are slower than the reference...)   
   >>   
   >> Thinking it over, it should be fastest to do all work in one 15   
   >> step loop. If you preload XMM0 through XMM2 with the FIZZes and   
   >> BUZZes and XMM3 with a 0x20 * 16 pattern, store 0x20 in as much   
   >> integer registers as digits required, you could skip the entire   
   >> hex to decimal conversation, as well. You just needed a routine   
   >> to increment the LSB register from 0x30 to 0x39, then increment   
   >> the next higher digit register, reset the LSB register to 0x30,   
   >> and so on. As last step of the routine the byte-portions of all   
   >> digit registers and the leading spaces (MOVQ XMM3) plus CRLF/LF   
   >> /CR (move BYTE/WORD to 0x14/0x15[address]) are written to their   
   >> proper places before the routine returns to the main loop. This   
   >> introduces one conditional jump per digit, two each tenth digit   
   >> (whenever the digit counter becomes zero), three each hundredth   
   >> digit, and so on (much faster than any conversion could be).   
   >   
   > This is similar to what I suggested a week or so ago, today I decided to   
   write a version and check   
   > the actual speed:   
   >   
   > My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n", one with   
   "FizzBuzz\n" and one   
   > that contains the current number as ascii with a trailing newline but no   
   leading spaces/zero digits.   
      
      
   Here we differ. I think of putting all required digits into integer   
   registers and write their byte portions to memory. Assuming, R15 is   
   the current address of the output buffer, EAX...EBP hold the digits   
   0...6 (expand with R9...R10 if required), R14 is the current count,   
   R13 holds the current count (0...9), R12 holds 0x30, R11 holds 0x31   
   and XMM3 holds 16 spaces:   
      
      
   fizzbuzz_loop:   
      
   [align to next multiple of 32]   
      
   call   write_routine        #     => 1st   
   call   write_routine        #     => 2nd   
   movdqa %xmm0, 0x00(%r15)    #     => FIZZ   
   addq   $0x10, %r15          # R15 = current EA   
   call   write_routine        #     => 4th   
   movdqa %xmm1, 0x00(%r15)    #     => BUZZ   
   movdqa %xmm0, 0x10(%r15)    #     => FIZZ   
   addq   $0x20, %r15          # R15 = current EA   
   call   write_routine        #     => 7th   
   call   write_routine        #     => 8th   
   movdqa %xmm0, 0x00(%r15)    #     => FIZZ   
   movdqa %xmm1, 0x10(%r15)    #     => BUZZ   
   call   write_routine        #     => 11th   
   movdqa %xmm0, 0x00(%r15)    #     => FIZZ   
   call   write_routine        #     => 13th   
   call   write_routine        #     => 14th   
   movdqa %xmm2, 0x00(%r15)    #     => FIZZBUZZ   
   decq   %r14                 # R14 = loop count   
   jne    fizzbuzz_loop   
      
   [END]   
      
   write_routine:   
      
   [align to next multiple of 32]   
      
   movq   %xmm3, 0x00(%r15)   
   movb   %bpl,  0x08(%r15)   
   movb   %sil,  0x09(%r15)   
   movb   %dil,  0x0A(%r15)   
   movb   %dl,   0x0B(%r15)   
   movb   %cl,   0x0C(%r15)   
   movb   %bl,   0x0D(%r15)   
   movb   %al,   0x0E(%r15)   
   movb   $0x0A, 0x0F(%r15)   
   addq   $0x10, %r15   
   incb   %al                  # RAX = 1st digit   
   decq   %r13   
   jne    RET   
   incb   %bl   
   movq   $0x09, %r13          # R13 = digit counter   
   cmpb   $0x20, %bl           # RBX = 2nd digit   
   cmove  %r11b, %bl   
   je     RET   
   cmpb   $0x09, %bl   
   cmova  %r12b, %bl   
   jne    RET   
   .   
   .                          # repeat for the remaining digits   
   .   
      
      
   [align to next multiple of 32]   
      
   RET:  nop   
          ret   
      
   Because we just count from n to m, we do not need a conversion, but   
   can count up our number using simple INC instructions. the compares   
   (prefering CMOV over JMP) are much faster than calculating stuff.   
      
   Writing to straight ascending addresses is ways faster than writing   
   stuff out of order, not to speak of overwriting already invalidated   
   memory locations in one and the same function.   
      
      
   Greetings from Augsburg   
      
   Bernhard Schornak   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]