Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,501 of 4,675    |
|    Terje Mathisen to Terje Mathisen    |
|    Re: Fast Fizz Buzz program    |
|    22 Jul 18 20:18:51    |
      From: terje.mathisen@nospicedham.tmsw.no              Terje Mathisen wrote:       > This is similar to what I suggested a week or so ago, today I decided       > to write a version and check the actual speed:       >       > My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n",       > one with "FizzBuzz\n" and one that contains the current number as       > ascii with a trailing newline but no leading spaces/zero digits.       >       > The final reg contains zero in all positions except the one that       > corresponds to the last digit of the ascii number reg where it has a       > 1 byte.       >       > I always output 3 blocks of 10 lines, using unaligned writes, and       > after each line I update the ascii number by adding that increment       > register.       >       > When I finish a block I have to handle the carry, so I go back to       > the memory version of the number where I increment the second-to-last       > digit, if that generates a carry I set it to '0' and step back one       > more digit and retry the increment.       >       > When I do this I also have to update the increment register since       > the last digit has now been pushed one byte further out.       >       > My first attempt was somewhat disappointing, it ran in almost exactly       > 10 cycles/line, i.e. 3+ ms on my 3+ GHz cpu. I suspect this is       > partly because I have to write everything to a single 7 MB buffer, so       > I'm getting a write speed of 2+ GB/s to RAM.              The actual write speed is quite a bit higher since I'm rewriting a lot       of output, i.e. for the range from 100K to 1M there are 190 output bytes       per 30 lines, or just 6.33 bytes/line so almost 10 of the 16 bytes are       wasted.              I made a small optimization where I stored "fuzz\n" directly after the       number value, then I could just save 5 more bytes for each of the 4       places in a 30-block where a number is followed by a fuzz line. I also       had to generate an inc2 register which would increment by 2 instead of 1       each time I skipped a line.              The end result was a drop from 10 to 8 cycles/line so now I'm getting       close to 1 cycle/byte.              I also considered storing those 190 output bytes in 12 SSE or 6 AVX       registers, but at that point I would also need to update the 18       different locations where a number is the output, and I could not reload       everything from memory for each iteration, i.e. I would need a full       190-byte increment array (rounded up to 192 and properly aligned) so       that I could increment all of them with 12/6 SSE/AVX byte adds.              This is still not enough since I would also have to handle carries into       at least one more digit, using compares against '9', byte shuffles and       another set of adds. I.e. this would be a lot more work for very       marginal gains. :-(              Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca