... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,500 of 4,675
Terje Mathisen to Bernhard Schornak
Re: Fast Fizz Buzz program
22 Jul 18 17:37:57
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Bernhard Schornak wrote:   
   > Why not SSE registers? Using a fixed width of 16 byte can serve   
   > all possible characters including LF, CR or CRLF. With variable   
   > width, you had to play around with the content of XMM registers   
   > and MOVDQU rather than MOVDQA (introducing penalties for uneven   
   > addresses not divisible by 16), but this applies to all integer   
   > registers as well, because unaligned memory accesses always are   
   > punished with extra cycles. (Which might be the reason why your   
   > attempts are slower than the reference...)   
   >   
   > Thinking it over, it should be fastest to do all work in one 15   
   > step loop. If you preload XMM0 through XMM2 with the FIZZes and   
   > BUZZes and XMM3 with a 0x20 * 16 pattern, store 0x20 in as much   
   > integer registers as digits required, you could skip the entire   
   > hex to decimal conversation, as well. You just needed a routine   
   > to increment the LSB register from 0x30 to 0x39, then increment   
   > the next higher digit register, reset the LSB register to 0x30,   
   > and so on. As last step of the routine the byte-portions of all   
   > digit registers and the leading spaces (MOVQ XMM3) plus CRLF/LF   
   > /CR (move BYTE/WORD to 0x14/0x15[address]) are written to their   
   > proper places before the routine returns to the main loop. This   
   > introduces one conditional jump per digit, two each tenth digit   
   > (whenever the digit counter becomes zero), three each hundredth   
   > digit, and so on (much faster than any conversion could be).   
      
   This is similar to what I suggested a week or so ago, today I decided to   
   write a version and check the actual speed:   
      
   My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n", one   
   with "FizzBuzz\n" and one that contains the current number as ascii with   
   a trailing newline but no leading spaces/zero digits.   
      
   The final reg contains zero in all positions except the one that   
   corresponds to the last digit of the ascii number reg where it has a 1 byte.   
      
   I always output 3 blocks of 10 lines, using unaligned writes, and after   
   each line I update the ascii number by adding that increment register.   
      
   When I finish a block I have to handle the carry, so I go back to the   
   memory version of the number where I increment the second-to-last digit,   
   if that generates a carry I set it to '0' and step back one more digit   
   and retry the increment.   
      
   When I do this I also have to update the increment register since the   
   last digit has now been pushed one byte further out.   
      
   My first attempt was somewhat disappointing, it ran in almost exactly 10   
   cycles/line, i.e. 3+ ms on my 3+ GHz cpu. I suspect this is partly   
   because I have to write everything to a single 7 MB buffer, so I'm   
   getting a write speed of 2+ GB/s to RAM.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]