... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,454 of 4,675
fizz buzz to Terje Mathisen
Re: Fast Fizz Buzz program
18 Jul 18 05:31:42
   From: demon.adramelek@nospicedham.gmail.com   
      
   On Tuesday, July 17, 2018 at 10:56:04 AM UTC-5, Terje Mathisen wrote:   
      
   > count = 0;   
   > for (j = 0; j < 30; j+= 10) {   
   >   number = sprintf("%d", count+j);   
   >   lastdigit = number + strlen(number)-1;   
   >   
   >   for (i = j; i < j+10; i++) {   
   >   
   >    if ((i % 15) == 0) {   
   >      printf("FizzBuzz");   
   >    }   
   >    else if ((i % 5) == 0) {   
   >      printf("Buzz");   
   >    }   
   >    else if ((i %3) == 0) {   
   >      printf("Fizz");   
   >    }   
   >    else {   
   >      printf(number);   
   >    }   
   >    printf("\n");   
   >    lastdigit++;   
   >   }   
   > }   
   >   
   > Instead of the printf() statements you would output the actual ASM code   
   > to handle this particular case, i.e. use a C program to generate an   
   > unrolled ASM program.   
   >   
   > By only generating the ascii numeric value once for every 10 lines, the   
   > bin-to-ascii overhead pretty much goes away and you end up with asm code   
   > like this:   
      
   Yes, I understand now, thank you! I used your idea and I was able to   
   considerably   
   reduce execution time. On my laptop it went down just a bit, but on my Linux   
   server   
   user execution time shows as 0.0001 s! Six times faster than before.   
      
   However, I should note couple things:   
      
   > mod_14:   
   >    mov esi, offset number   
   >    mov ecx, edx			; Current number length   
   >    rep movsb   
   >    inc bl			; Last digit value   
   >    mov [ebp],bl			; Update last digit   
   >   
   > mod_15:   
   >    mov esi, offset FizzBuzz   
   >    mov ecx, 10			; FizzBuzz+CRLF length   
   >    rep movsb   
   >    inc bl			; Last digit value   
   >    mov [ebp],bl			; Update last digit   
      
   I don't use REP MOVSB neither when printing a number nor when   
   printing a word. I found that it's faster to keep everything in registers, copy   
   the whole register (8 bytes) to a buffer, and then adjust buffer length by   
   actual length of data.   
      
   If you check my first source code, you'll see that I don't do any memory reads   
   except in PrintNumber routine. All constants (strings "Fizz", "Buzz",   
   "FizzBuzz",   
   etc.) are kept in registers. Even in PrintNumber routine I copy and accumulate   
   ASCII digits in a register instead of memory buffer. That's why I can print   
   numbers   
   only up to seven digits (seven digits plus newline character is eight bytes,   
   exactly   
   the size of 64 bit register).   
      
   > The next obvious tweak is to only update the ascii number value when   
   > needed, i.e. increment by two when the next line will output Fizz/Buzz   
   > and then skip the update for those lines.   
      
   Yes, that's what I'm doing.   
      
   > Total time should be ~5 cycles per line, so 5 million cycles or 1.5 ms   
   > to count from 1 to 1E6.   
   >   
   > The next idea would use SSE regs to write the output bytes, with   
   > unaligned store operations but no memory value updates:   
   >   
   > We can fit two 6-digit numbers+CRLF   
      
   Linux needs only one character to represent new line.   
      
   > in a 16-byte SSE reg, 4 in AVX256   
   > and 8 in AVX512, so we only need 8 AVX regs to handle all 30 output   
   > lines for an unrolled block. Even if we have to specialcase all values   
   > below 100 000, the 900K values from there and up would run at pretty   
   > much streaming memory store speed, so several GB/s and the possibility   
   > to get down to less than a cycle/line.   
      
   I have to check if my laptop and server both support AVX512.   
      
   However, I suspect it won't be as easy as it seems. Even after first 100000   
   values there   
   still will be words that have to be printed (i.e. stored in AVX register mixed   
   with numeric   
   values), and those words have different lengths. Plus, if I remember   
   correctly, it's   
   possible to OR memory content and AVX reg, but not regular reg and AVX reg. In   
   other   
   words, every time there is a need to shift AVX reg and load new string in it,   
   one will have   
   to do memory read, and suffer speed penalty for that.   
      
   > Maybe 0.2 to 0.3 ms in total, a bit more if we have to be able to save   
   > the output to a file?   
      
   Current speed on server:   
      
   bash$ time ./asm-unfold-mul > /dev/null   
      
   real    0m0.007s   
   user    0m0.001s   
   sys     0m0.006s   
      
   On laptop:   
      
   bash$ time ./asm-unfold-mul > /dev/null   
      
   real	0m0.008s   
   user	0m0.004s   
   sys	0m0.003s   
      
   P.S. My current code is quite long. Is it OK to post (almost 10 Kb), would it   
   be OK to post it here?   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]