Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,450 of 4,675    |
|    Terje Mathisen to fizz buzz    |
|    Re: Fast Fizz Buzz program    |
|    17 Jul 18 17:48:18    |
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   fizz buzz wrote:   
   > On Monday, July 16, 2018 at 4:01:23 AM UTC-5, Terje Mathisen wrote:   
   >> It is tempting to preformat the output, unroll by 30 and work in   
   >> ascii since that would leave all the ascii carries in fixed   
   >> positions, i.e. the bottom digit would stay fixed and then you   
   >> would add '3' to the second digit and handle any carries. I do   
   >> suspect thought that you can in fact work in binary and do the   
   >> bin-to-ascii conversion on the fly for the 8 out of 15 values that   
   >> require this.   
   >   
   > I'm not sure I fully understand this paragraph. Can you please give a   
   > code example? (Don't have to be actual assembler code, just   
   > pseudo-code.)   
      
      
   count = 0;   
   for (j = 0; j < 30; j+= 10) {   
    number = sprintf("%d", count+j);   
    lastdigit = number + strlen(number)-1;   
      
    for (i = j; i < j+10; i++) {   
      
    if ((i % 15) == 0) {   
    printf("FizzBuzz");   
    }   
    else if ((i % 5) == 0) {   
    printf("Buzz");   
    }   
    else if ((i %3) == 0) {   
    printf("Fizz");   
    }   
    else {   
    printf(number);   
    }   
    printf("\n");   
    lastdigit++;   
    }   
   }   
      
   Instead of the printf() statements you would output the actual ASM code   
   to handle this particular case, i.e. use a C program to generate an   
   unrolled ASM program.   
      
   By only generating the ascii numeric value once for every 10 lines, the   
   bin-to-ascii overhead pretty much goes away and you end up with asm code   
   like this:   
      
   mod_0:   
    mov esi, offset FizzBuzz   
    mov ecx, 10 ; FizzBuzz+CRLF length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_1:   
    mov esi, offset number   
    mov ecx, edx ; Current number length, including CRLF   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_2:   
    mov esi, offset number   
    mov ecx, edx ; Current number length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_3:   
    mov esi, offset Fizz   
    mov ecx, 6 ; Fizz+CRLF length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_4:   
    mov esi, offset number   
    mov ecx, edx ; Current number length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_5:   
    mov esi, offset Buzz   
    mov ecx, 6 ; Buzz+CRLF length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   etc up to   
      
   mod_14:   
    mov esi, offset number   
    mov ecx, edx ; Current number length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   mod_15:   
    mov esi, offset FizzBuzz   
    mov ecx, 10 ; FizzBuzz+CRLF length   
    rep movsb   
    inc bl ; Last digit value   
    mov [ebp],bl ; Update last digit   
      
   repeat for 16 to 29   
      
   The next obvious tweak is to only update the ascii number value when   
   needed, i.e. increment by two when the next line will output Fizz/Buzz   
   and then skip the update for those lines.   
      
   Total time should be ~5 cycles per line, so 5 million cycles or 1.5 ms   
   to count from 1 to 1E6.   
      
   The next idea would use SSE regs to write the output bytes, with   
   unaligned store operations but no memory value updates:   
      
   We can fit two 6-digit numbers+CRLF in a 16-byte SSE reg, 4 in AVX256   
   and 8 in AVX512, so we only need 8 AVX regs to handle all 30 output   
   lines for an unrolled block. Even if we have to specialcase all values   
   below 100 000, the 900K values from there and up would run at pretty   
   much streaming memory store speed, so several GB/s and the possibility   
   to get down to less than a cycle/line.   
      
   Maybe 0.2 to 0.3 ms in total, a bit more if we have to be able to save   
   the output to a file?   
      
   Terje   
      
   --   
   -
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca