From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Wed, 6 Aug 2025 16:19:11 +0200   
   > Terje Mathisen wrote:   
   >   
   >> Michael S wrote:   
   >>> On Tue, 5 Aug 2025 22:17:00 +0200   
   >>> Terje Mathisen wrote:   
   >>>   
   >>>> Michael S wrote:   
   >>>>> On Tue, 5 Aug 2025 17:31:34 +0200   
   >>>>> Terje Mathisen wrote:   
   >>>>> In this case 'adc edx,edx' is just slightly shorter encoding   
   >>>>> of 'adc edx,0'. EDX register zeroize few lines above.   
   >>>>   
   >>>> OK, nice.   
   >>>   
   >>> BTW, it seems that in your code fragment above you forgot to   
   >>> zeroize EDX at the beginning of iteration. Or am I mssing   
   >>> something?   
   >>   
   >> No, you are not. I skipped pretty much all the setup code. :-)   
   >   
   > It's a setup code that looks to me as missing, but zeroing of RDX in   
   > the body of the loop.   
      
   I don't remember my code exactly, but the intent was that RDX would   
   contain any incoming carries (0,1,2) from the previous iteration.   
      
   Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.   
      
   Terje   
   >   
   > I did few tests on few machines: Raptor Cove (i7-14700 P core),   
   > Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC   
   > 7543P).   
   > In order to see effects more clearly I had to modify Anton's function:   
   > to one that operates on pointers, because otherwise too much time was   
   > spend at caller's site copying things around which made the   
   > measurements too noisy.   
   >   
   > void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const   
   > uintNN_t* c) {   
   > *dst = *a + *b + *c;   
   > }   
   >   
   >   
   > After the change on 3 out of 4 platforms I had seen a significant   
   > speed-up after modification. The only platform where speed-up was   
   > non-significant was Skylake, probably because its rename stage is too   
   > narrow to profit from the change. The widest machine (Raptor Cove)   
   > benefited most.   
   > The results appear non-conclusive with regard to question whether   
   > dependency between loop iterations is eliminated completely or just   
   > shortened to 1-2 clock cycles per iteration. Even the widest of my   
   > cores is relatively narrow. Considering that my variant of loop contains   
   > 13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple   
   > M4 would be too narrow :(   
   >   
   > Here are results in nanoseconds for N=65472   
   > Platform RC GM SK Z3   
   > clang 896.1 1476.7 1453.2 1348.0   
   > gcc 879.2 1661.4 1662.9 1655.0   
   > x86 585.8 1489.3 901.5 672.0   
   > Terje's 772.6 1293.2 1012.6 1127.0   
   > My 397.5 803.8 965.3 660.0   
   > ADX 579.1 1650.1 728.9 853.0   
   > x86/u2 581.5 1246.2 679.9 584.0   
   > Terje's/u3 503.7 954.3 630.9 755.0   
   > My/u3 266.6 487.2 486.5 440.0   
   > ADX/u8 350.4 839.3 490.4 451.0   
   >   
   > 'x86' is a variant that that was sketched in one of my above   
   > posts. It calculates the sum in two passes over arrays.   
   > 'ADX' is a variant that uses ADCX/ADOX instructions as suggested by   
   > Anton, but unlike his suggestion does it in a loop rather than in long   
   > straight code sequence.   
   > /u2, /u3, /u8 indicate unroll factors of the inner loop.   
   >   
   > Frequency:   
   > RC 5.30 GHz (Est)   
   > GM 4.20 GHz (Est)   
   > SK 4.25 GHz   
   > Z3 3.70 GHz   
   >   
      
   Thanks for an interesting set of tests/results!   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|