home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 129,303 of 131,241   
   Michael S to Terje Mathisen   
   3-way long addition (was: VAX)   
   06 Aug 25 20:43:33   
   
   From: already5chosen@yahoo.com   
      
   On Wed, 6 Aug 2025 16:19:11 +0200   
   Terje Mathisen  wrote:   
      
   > Michael S wrote:   
   > > On Tue, 5 Aug 2025 22:17:00 +0200   
   > > Terje Mathisen  wrote:   
   > >   
   > >> Michael S wrote:   
   > >>> On Tue, 5 Aug 2025 17:31:34 +0200   
   > >>> Terje Mathisen  wrote:   
   > >>> In this case 'adc edx,edx' is just slightly shorter encoding   
   > >>> of 'adc edx,0'. EDX register zeroize few lines above.   
   > >>   
   > >> OK, nice.   
   > >   
   > > BTW, it seems that in your code fragment above you forgot to   
   > > zeroize EDX at the beginning of iteration. Or am I mssing   
   > > something?   
   >   
   > No, you are not. I skipped pretty much all the setup code. :-)   
      
   It's a setup code that looks to me as missing, but zeroing of RDX in   
   the body of the loop.   
      
   > >   
   > >>>   
   > >>>> Anyway, the three main ADD RAX,... operations still define the   
   > >>>> minimum possible latency, right?   
   > >>>>   
   > >>>   
   > >>> I don't think so.   
   > >>> It seems to me that there is only one chains of data dependencies   
   > >>> between iterations of the loop - a trivial dependency through RCX.   
   > >>> Some modern processors are already capable to eliminate this sort   
   > >>> of dependency in renamer. Probably not yet when it is coded as   
   > >>> 'inc', but when coded as 'add' or 'lea'.   
   > >>>   
   > >>> The dependency through RDX/RBX does not form a chain. The next   
   > >>> value of [rdi+rcx*8] does depend on value of rbx from previous   
   > >>> iteration, but the next value of rbx depends only on [rsi+rcx*8],   
   > >>> [r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous   
   > >>> value of rbx, except for control dependency that hopefully would   
   > >>> be speculated around.   
   > >>   
   > >> I believe we are doing a bigint thre-way add, so each result word   
   > >> depends on the three corresponding input words, plus any carries   
   > >> from the previous round.   
   > >>   
   > >> This is the carry chain that I don't see any obvious way to   
   > >> break...   
   > >   
   > > You break the chain by *predicting* that   
   > > carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to   
   > > CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you   
   > > pay a heavy price of branch misprediction. But outside of specially   
   > > crafted inputs it is extremely rare.   
   >   
   > Aha!   
   >   
   > That's _very_ nice.   
   >   
   > Terje   
   >   
   >   
      
   I did few tests on few machines: Raptor Cove (i7-14700 P core),   
   Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC   
   7543P).   
   In order to see effects more clearly I had to modify Anton's function:   
   to one that operates on pointers, because otherwise too much time was   
   spend at caller's site copying things around which made the   
   measurements too noisy.   
      
   void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const   
   uintNN_t* c) {   
     *dst = *a + *b + *c;   
   }   
      
      
   After the change on 3 out of 4 platforms I had seen a significant   
   speed-up after modification. The only platform where speed-up was   
   non-significant was Skylake, probably because its rename stage is too   
   narrow to profit from the change. The widest machine (Raptor Cove)   
   benefited most.   
   The results appear non-conclusive with regard to question whether   
   dependency between loop iterations is eliminated completely or just   
   shortened to 1-2 clock cycles per iteration. Even the widest of my   
   cores is relatively narrow. Considering that my variant of loop contains   
   13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple   
   M4 would be too narrow :(   
      
   Here are results in nanoseconds for N=65472   
   Platform    RC      GM       SK       Z3   
   clang      896.1   1476.7  1453.2   1348.0   
   gcc        879.2   1661.4  1662.9   1655.0   
   x86        585.8   1489.3   901.5    672.0   
   Terje's    772.6   1293.2  1012.6   1127.0   
   My         397.5    803.8   965.3    660.0   
   ADX        579.1   1650.1   728.9    853.0   
   x86/u2     581.5   1246.2   679.9    584.0   
   Terje's/u3 503.7    954.3   630.9    755.0   
   My/u3      266.6    487.2   486.5    440.0   
   ADX/u8     350.4    839.3   490.4    451.0   
      
   'x86' is a variant that  that was sketched in one of my above   
   posts. It calculates the sum in two passes over arrays.   
   'ADX' is a variant that uses ADCX/ADOX instructions as suggested by   
   Anton, but unlike his suggestion does it in a loop rather than in long   
   straight code sequence.   
   /u2, /u3, /u8 indicate unroll factors of the inner loop.   
      
   Frequency:   
   RC 5.30 GHz (Est)   
   GM 4.20 GHz (Est)   
   SK 4.25 GHz   
   Z3 3.70 GHz   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca