Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,505 of 4,675    |
|    Bernhard Schornak to Terje Mathisen    |
|    Re: Fast Fizz Buzz program    |
|    23 Jul 18 11:24:59    |
      From: schornak@nospicedham.web.de              Terje Mathisen wrote:                     > Bernhard Schornak wrote:       >> Why not SSE registers? Using a fixed width of 16 byte can serve       >> all possible characters including LF, CR or CRLF. With variable       >> width, you had to play around with the content of XMM registers       >> and MOVDQU rather than MOVDQA (introducing penalties for uneven       >> addresses not divisible by 16), but this applies to all integer       >> registers as well, because unaligned memory accesses always are       >> punished with extra cycles. (Which might be the reason why your       >> attempts are slower than the reference...)       >>       >> Thinking it over, it should be fastest to do all work in one 15       >> step loop. If you preload XMM0 through XMM2 with the FIZZes and       >> BUZZes and XMM3 with a 0x20 * 16 pattern, store 0x20 in as much       >> integer registers as digits required, you could skip the entire       >> hex to decimal conversation, as well. You just needed a routine       >> to increment the LSB register from 0x30 to 0x39, then increment       >> the next higher digit register, reset the LSB register to 0x30,       >> and so on. As last step of the routine the byte-portions of all       >> digit registers and the leading spaces (MOVQ XMM3) plus CRLF/LF       >> /CR (move BYTE/WORD to 0x14/0x15[address]) are written to their       >> proper places before the routine returns to the main loop. This       >> introduces one conditional jump per digit, two each tenth digit       >> (whenever the digit counter becomes zero), three each hundredth       >> digit, and so on (much faster than any conversion could be).       >       > This is similar to what I suggested a week or so ago, today I decided to       write a version and check       > the actual speed:       >       > My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n", one with       "FizzBuzz\n" and one       > that contains the current number as ascii with a trailing newline but no       leading spaces/zero digits.                     Here we differ. I think of putting all required digits into integer       registers and write their byte portions to memory. Assuming, R15 is       the current address of the output buffer, EAX...EBP hold the digits       0...6 (expand with R9...R10 if required), R14 is the current count,       R13 holds the current count (0...9), R12 holds 0x30, R11 holds 0x31       and XMM3 holds 16 spaces:                     fizzbuzz_loop:              [align to next multiple of 32]              call write_routine # => 1st       call write_routine # => 2nd       movdqa %xmm0, 0x00(%r15) # => FIZZ       addq $0x10, %r15 # R15 = current EA       call write_routine # => 4th       movdqa %xmm1, 0x00(%r15) # => BUZZ       movdqa %xmm0, 0x10(%r15) # => FIZZ       addq $0x20, %r15 # R15 = current EA       call write_routine # => 7th       call write_routine # => 8th       movdqa %xmm0, 0x00(%r15) # => FIZZ       movdqa %xmm1, 0x10(%r15) # => BUZZ       call write_routine # => 11th       movdqa %xmm0, 0x00(%r15) # => FIZZ       call write_routine # => 13th       call write_routine # => 14th       movdqa %xmm2, 0x00(%r15) # => FIZZBUZZ       decq %r14 # R14 = loop count       jne fizzbuzz_loop              [END]              write_routine:              [align to next multiple of 32]              movq %xmm3, 0x00(%r15)       movb %bpl, 0x08(%r15)       movb %sil, 0x09(%r15)       movb %dil, 0x0A(%r15)       movb %dl, 0x0B(%r15)       movb %cl, 0x0C(%r15)       movb %bl, 0x0D(%r15)       movb %al, 0x0E(%r15)       movb $0x0A, 0x0F(%r15)       addq $0x10, %r15       incb %al # RAX = 1st digit       decq %r13       jne RET       incb %bl       movq $0x09, %r13 # R13 = digit counter       cmpb $0x20, %bl # RBX = 2nd digit       cmove %r11b, %bl       je RET       cmpb $0x09, %bl       cmova %r12b, %bl       jne RET       .       . # repeat for the remaining digits       .                     [align to next multiple of 32]              RET: nop        ret              Because we just count from n to m, we do not need a conversion, but       can count up our number using simple INC instructions. the compares       (prefering CMOV over JMP) are much faster than calculating stuff.              Writing to straight ascending addresses is ways faster than writing       stuff out of order, not to speak of overwriting already invalidated       memory locations in one and the same function.                     Greetings from Augsburg              Bernhard Schornak              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca