... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,501 of 4,675
Terje Mathisen to Terje Mathisen
Re: Fast Fizz Buzz program
22 Jul 18 20:18:51
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Terje Mathisen wrote:   
   > This is similar to what I suggested a week or so ago, today I decided   
   > to write a version and check the actual speed:   
   >   
   > My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n",   
   > one with "FizzBuzz\n" and one that contains the current number as   
   > ascii with a trailing newline but no leading spaces/zero digits.   
   >   
   > The final reg contains zero in all positions except the one that   
   > corresponds to the last digit of the ascii number reg where it has a   
   > 1 byte.   
   >   
   > I always output 3 blocks of 10 lines, using unaligned writes, and   
   > after each line I update the ascii number by adding that increment   
   > register.   
   >   
   > When I finish a block I have to handle the carry, so I go back to   
   > the memory version of the number where I increment the second-to-last   
   > digit, if that generates a carry I set it to '0' and step back one   
   > more digit and retry the increment.   
   >   
   > When I do this I also have to update the increment register since   
   > the last digit has now been pushed one byte further out.   
   >   
   > My first attempt was somewhat disappointing, it ran in almost exactly   
   > 10 cycles/line, i.e. 3+ ms on my 3+ GHz cpu. I suspect this is   
   > partly because I have to write everything to a single 7 MB buffer, so   
   > I'm getting a write speed of 2+ GB/s to RAM.   
      
   The actual write speed is quite a bit higher since I'm rewriting a lot   
   of output, i.e. for the range from 100K to 1M there are 190 output bytes   
   per 30 lines, or just 6.33 bytes/line so almost 10 of the 16 bytes are   
   wasted.   
      
   I made a small optimization where I stored "fuzz\n" directly after the   
   number value, then I could just save 5 more bytes for each of the 4   
   places in a 30-block where a number is followed by a fuzz line. I also   
   had to generate an inc2 register which would increment by 2 instead of 1   
   each time I skipped a line.   
      
   The end result was a drop from 10 to 8 cycles/line so now I'm getting   
   close to 1 cycle/byte.   
      
   I also considered storing those 190 output bytes in 12 SSE or 6 AVX   
   registers, but at that point I would also need to update the 18   
   different locations where a number is the output, and I could not reload   
   everything from memory for each iteration, i.e. I would need a full   
   190-byte increment array (rounded up to 192 and properly aligned) so   
   that I could increment all of them with 12/6 SSE/AVX byte adds.   
      
   This is still not enough since I would also have to handle carries into   
   at least one more digit, using compares against '9', byte shuffles and   
   another set of adds. I.e. this would be a lot more work for very   
   marginal gains. :-(   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]