... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,461 of 4,675
Rod Pemberton to fizz buzz
Re: Fast Fizz Buzz program (1/2)
18 Jul 18 16:20:02
   From: invalid@nospicedham.lkntrgzxc.com   
      
   On Wed, 18 Jul 2018 06:00:13 -0700 (PDT)   
   fizz buzz  wrote:   
      
   > On Wednesday, July 18, 2018 at 12:41:53 AM UTC-5, Rod Pemberton wrote:   
      
   > > Second, since Terje and Rick posted some C, I'll do so as well.  As   
   > > you can see, no division or modulo is required for the base   
   > > problem.  You can use down counters.  The code also falls through   
   > > instead of if-else.   
   > >   
   > > Since it's not unrolled like Terje's example, nor optimized assembly   
   > > with shifts like yours, I'm curious as to how slow it is.  Could you   
   > > time it?   
   >   
   > On my laptop:   
   >   
   > bash$ time ./fizzbuzz > /dev/null   
   >   
   > real	0m0.168s   
   > user	0m0.154s   
   > sys	0m0.002s   
   >   
      
   Well, I was interested in timings for the other C versions ...  So, I'll   
   post timings for assembly (your 2 posts) and C code (mine, Terje, Rick)   
   for 64-bit Linux on my older processor.  I also post the modified and   
   completed versions of Terje's, Rick's, and my C code that I used for   
   testing.   
      
      
   (3.2Ghz AMD Phenom II X2 555 dual-core, 64-bit Linux)   
      
   Please note that results from "time" can vary +/- 0m0.002/0m0.003 or   
   more on my machine.  It's not real precise.  I've not averaged the   
   results of multiple runs, but simply picked the most representative or   
   frequent result.   
      
      
   For 100 to 1M, for my C version, I get:   
   (gcc -Wall -ansi -pedantic -O2)   
      
   :# time ./fizzbuzz > /dev/null   
      
   real	0m0.083s   
   user	0m0.081s   
   sys	0m0.000s   
      
      
   For 100 to 1M, for Terje's C version (reworked, completed), I get:   
   (gcc -Wall -ansi -pedantic -O2)   
      
   :# time ./fizzbuzz > /dev/null   
      
   real	0m0.085s   
   user	0m0.083s   
   sys	0m0.001s   
      
      
   For 0 (not 100) to 1M for Rick's C version (extended cycle), I get:   
   (gcc -Wall -ansi -pedantic -O2)   
      
   :# time ./rick > /dev/null   
      
   real	0m0.077s   
   user	0m0.074s   
   sys	0m0.002s   
      
   Unfortunately, there is no easy way to adjust the bitmaps in Rick's code   
   for a different initial value, i.e., 100 instead of 1.  Shifting the   
   bitmaps or creating them at runtime adds time too.  So, we'll just   
   subtract Rick's time for 0 to 100 from his 0 to 1M time:   
      
   real	0m0.001s   
   user	0m0.000s   
   sys	0m0.000s   
      
   For 100 to 1M, net time for Rick's C version (extended cycle):   
      
   real	0m0.076s   
   user	0m0.074s   
   sys	0m0.002s   
      
   Rick wins the C version for 100 to 1M, even after having the cycle   
   extended, i.e., the code is doing much more work now.  From the   
   assembly, it seems that this code generates well above average   
   instruction pairing and near full 64-bit register usage.  So, the   
   substantially longer, i.e., huge, instruction sequence is apparently   
   nullified.  I've never before seen C compilers like GCC generate code   
   like this, but I'm used to code from the 32-bit GCC 3.x series, not   
   64-bit 4.x series.   
      
      
   Now for your two assembly versions.   
      
   For 100 to 1M, for your assembly version asm-bitmask-mul, I get:   
      
   root:# time ./asm-bitmask-mul > /dev/null   
   (gcc -nostdlib)   
      
   real	0m0.014s   
   user	0m0.010s   
   sys	0m0.004s   
      
   That's slightly slower than your server:   
      
   > bash$ time ./asm-bitmask-mul > /dev/null   
   >   
   > real        0m0.010s   
   > user        0m0.006s   
   > sys         0m0.003s   
      
   For 100 to 1M, for your assembly version asm-unfold-mul, I get:   
   (gcc -nostdlib)   
      
   :# time ./asm-unfold-mul > /dev/null   
      
   real	0m0.007s   
   user	0m0.003s   
   sys	0m0.003s   
      
   That's slightly faster than you laptop and very close to your server:   
      
   > Current speed on server:   
   >   
   > bash$ time ./asm-unfold-mul > /dev/null   
   >   
   > real    0m0.007s   
   > user    0m0.001s   
   > sys     0m0.006s   
   >   
   > On laptop:   
   >   
   > bash$ time ./asm-unfold-mul > /dev/null   
   >   
   > real        0m0.008s   
   > user        0m0.004s   
   > sys         0m0.003s   
      
   My 3.2Ghz AMD Phenom II X2 555 dual-core processor is from 2009 which is   
   well over a decade old.  Your C results are twice as slow on your   
   laptop versus my machine, but the assembly is slightly faster for both   
   your server and laptop.  How can that be?  ...  I don't know why there   
   would be such a huge discrepancy for the C code, while your assembly is   
   slightly faster.   
      
   Did you use -O2 for the C code? (gcc -Wall -ansi -pedantic -O2).  Even   
   if I use -O0 on my C code, it only slows down very slightly:   
      
   :# time ./fizzbuzz > /dev/null   
   (gcc -Wall -ansi -pedantic -O0)   
      
   real	0m0.087s   
   user	0m0.083s   
   sys	0m0.002s   
      
   > On my laptop:   
   >   
   > bash$ time ./fizzbuzz > /dev/null   
   >   
   > real	0m0.168s   
   > user	0m0.154s   
   > sys	0m0.002s   
   >   
      
   Clearly, it doesn't slow down to the speed you're getting for C on your   
   laptop, as I would expect, given that the assembly results are so close   
   together.   
      
   What C compiler are you using?  (GCC v. 4.7.2 here, 64-bit Linux)   
      
   What processors are being used on your laptop and server?  I.e., are   
   they a decade old like mine is or are they new and underpowered?   
      
      
   Here's the reduced and completed C version of Terje's:   
      
   #include    
   #include    
      
   int main(void)   
   {   
     int i;   
      
     for(i=100;i<1000001;i++) {   
      
       if((i%15)==0) {   
         printf("FizzBuzz");   
       }   
       else if((i%5)==0) {   
         printf("Buzz");   
       }   
       else if((i%3)==0) {   
         printf("Fizz");   
       }   
       else {   
         printf("%d",i);   
      }   
      printf("\n");   
     }   
      
   return(0);   
   }   
      
      
   Here's the adjusted C version of mine:   
      
   #include    
      
   int main(void)   
   {   
     int i,i3,i5;   
      
     i=100;   
     i3=((3-(i%3))%3)+1;   
     i5=((5-(i%5))%5)+1;   
     while(1)   
     {   
       i3--;   
       i5--;   
       if(!i3)   
       {   
         printf("Fizz");   
         i3=3;   
       }   
       if(!i5)   
       {   
         printf("Buzz");   
         i5=5;   
       }   
       if((i3<3)&(i5<5))   
         printf("%d",i);   
       printf("\n");   
       i++;   
       if(i==1000001)   
         break;   
     }   
      
   return(0);   
   }   
      
      
   Here's the corrected and bitmask extended C version of Rick's:   
      
   #include    
   #include    
   #include    
      
   void fizz(uint32_t num) {printf("Fizz\n");}   
   void buzz(uint32_t num) {printf("Buzz\n");}   
   void fizzbuzz(uint32_t num) {printf("FizzBuzz\n");}   
   void other(uint32_t num) {printf("%d\n",num);}   
      
   struct STargets {void (*func)(uint32_t num);};   
   struct STargets funcs[4]={{other},{fizz},{buzz},{fizzbuzz}};   
      
   uint64_t three_data[3]={0x4924924924924924,0x2492492492492492,   
                           0x9249249249249249};   
   uint64_t five_data[5]={0x0842108421084210,0x1084210842108421,   
                          0x2108421084210842,0x4210842108421084,   
                          0x8421084210842108};   
      
   int main(int argc, char* argv[])   
   {   
     uint32_t lnI;   
     uint64_t saved_data;   
      
   /* WARNING bitmasks start at value one */   
   /* bitmasks must be shifted for lnI != 1 */   
      
     for(lnI=1;lnI<=1000000;++lnI)   
     {   
       funcs[(three_data[0]&0x1)+(2*(five_data[0]&0x1))].func(lnI);   
      
       saved_data=three_data[0];   
       three_data[0]=(three_data[0]>>1)|((three_data[1]&1)<<63);   
       three_data[1]=(three_data[1]>>1)|((three_data[2]&1)<<63);   
       three_data[2]=(three_data[2]>>1)|((saved_data&1)<<63);   
      
       saved_data=five_data[0];   
       five_data[0]=(five_data[0]>>1)|((five_data[1]&1)<<63);   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]