... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,329 of 131,241
Terje Mathisen to Michael S
Re: 3-way long addition
07 Aug 25 15:15:17
   From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Wed, 6 Aug 2025 16:19:11 +0200   
   > Terje Mathisen  wrote:   
   >   
   >> Michael S wrote:   
   >>> On Tue, 5 Aug 2025 22:17:00 +0200   
   >>> Terje Mathisen  wrote:   
   >>>   
   >>>> Michael S wrote:   
   >>>>> On Tue, 5 Aug 2025 17:31:34 +0200   
   >>>>> Terje Mathisen  wrote:   
   >>>>> In this case 'adc edx,edx' is just slightly shorter encoding   
   >>>>> of 'adc edx,0'. EDX register zeroize few lines above.   
   >>>>   
   >>>> OK, nice.   
   >>>   
   >>> BTW, it seems that in your code fragment above you forgot to   
   >>> zeroize EDX at the beginning of iteration. Or am I mssing   
   >>> something?   
   >>   
   >> No, you are not. I skipped pretty much all the setup code. :-)   
   >   
   > It's a setup code that looks to me as missing, but zeroing of RDX in   
   > the body of the loop.   
      
   I don't remember my code exactly, but the intent was that RDX would   
   contain any incoming carries (0,1,2) from the previous iteration.   
      
   Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.   
      
   Terje   
   >   
   > I did few tests on few machines: Raptor Cove (i7-14700 P core),   
   > Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC   
   > 7543P).   
   > In order to see effects more clearly I had to modify Anton's function:   
   > to one that operates on pointers, because otherwise too much time was   
   > spend at caller's site copying things around which made the   
   > measurements too noisy.   
   >   
   > void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const   
   > uintNN_t* c) {   
   >    *dst = *a + *b + *c;   
   > }   
   >   
   >   
   > After the change on 3 out of 4 platforms I had seen a significant   
   > speed-up after modification. The only platform where speed-up was   
   > non-significant was Skylake, probably because its rename stage is too   
   > narrow to profit from the change. The widest machine (Raptor Cove)   
   > benefited most.   
   > The results appear non-conclusive with regard to question whether   
   > dependency between loop iterations is eliminated completely or just   
   > shortened to 1-2 clock cycles per iteration. Even the widest of my   
   > cores is relatively narrow. Considering that my variant of loop contains   
   > 13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple   
   > M4 would be too narrow :(   
   >   
   > Here are results in nanoseconds for N=65472   
   > Platform    RC      GM       SK       Z3   
   > clang      896.1   1476.7  1453.2   1348.0   
   > gcc        879.2   1661.4  1662.9   1655.0   
   > x86        585.8   1489.3   901.5    672.0   
   > Terje's    772.6   1293.2  1012.6   1127.0   
   > My         397.5    803.8   965.3    660.0   
   > ADX        579.1   1650.1   728.9    853.0   
   > x86/u2     581.5   1246.2   679.9    584.0   
   > Terje's/u3 503.7    954.3   630.9    755.0   
   > My/u3      266.6    487.2   486.5    440.0   
   > ADX/u8     350.4    839.3   490.4    451.0   
   >   
   > 'x86' is a variant that  that was sketched in one of my above   
   > posts. It calculates the sum in two passes over arrays.   
   > 'ADX' is a variant that uses ADCX/ADOX instructions as suggested by   
   > Anton, but unlike his suggestion does it in a loop rather than in long   
   > straight code sequence.   
   > /u2, /u3, /u8 indicate unroll factors of the inner loop.   
   >   
   > Frequency:   
   > RC 5.30 GHz (Est)   
   > GM 4.20 GHz (Est)   
   > SK 4.25 GHz   
   > Z3 3.70 GHz   
   >   
      
   Thanks for an interesting set of tests/results!   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]