... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,446 of 131,241
Terje Mathisen to Michael S
Re: 3-way long addition
20 Aug 25 10:50:39
   From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Tue, 19 Aug 2025 05:47:01 GMT   
   > anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   >   
   >>   
   >> One other problem is that according to Agner Fog's instruction tables,   
   >> even the latest and greatest CPUs from AMD and Intel that he measured   
   >> (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,   
   >   
   > I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3   
   > are certainly capable of more than 1 adcx|adox per cycle.   
   >   
   > Below are Execution times of very heavily unrolled adcx/adox code with   
   > dependency broken by trick similiar to above:   
   >   
   > Platform         RC     GM     SK     Z3   
   > add3_my_adx_u17  244.5  471.1  482.4  407.0   
   >   
   > Considering that there are 2166 adcx/adox/adc instructions, we have   
   > following number of adcx/adox/adc instructions per clock:   
   > Platform         RC     GM     SK    Z3   
   >                  1.67   1.10   1.05   1.44   
   >   
   > For Gracemont and Skylake there exists a possibility of small   
   > measurement mistake, but Raptor Cove appears to be capable of at least 2   
   > instructions of this type per clock while Zen3 capable of at least 1.5   
   > but more likely also 2.   
   > It looks to me that the bottlenecks on both RC and Z3 are either rename   
   > phase or more likely L1$ access. It seems that while Golden/Raptore Cove   
   > can occasionally issue 3 load + 2 stores per clock, it can not sustain   
   > more than 3 load-or-store accesses per clock   
   >   
   >   
   > Code:   
   >   
   >    .file "add3_my_adx_u17.s"   
   >    .text   
   >    .p2align 4   
   >    .globl  add3   
   >    .def  add3; .scl  2;  .type 32; .endef   
   >    .seh_proc add3   
   > add3:   
   >    pushq %rsi   
   >    .seh_pushreg  %rsi   
   >    pushq %rbx   
   >    .seh_pushreg  %rbx   
   >    .seh_endprologue   
   >    # %rcx - dst   
   >    # %rdx - a   
   >    # %r8  - b   
   >    # %r9  - c   
   >    sub %rdx, %rcx   
   >    mov %rcx, %r10  # r10 = dst - a   
   >    sub %rdx, %r8   # r8  = b - a   
   >    sub %rdx, %r9   # r9  = c - c   
   >    mov %rdx, %r11  # r11 - a   
   >    mov $60,  %edx   
   >    xor %ecx, %ecx   
   >    .p2align 4   
   >    .loop:   
   >      xor   %ebx,       %ebx # CF <= 0, OF <= 0, EBX <= 0   
   >      mov  (%r11),      %rsi   
   >      adcx (%r11,%r8),  %rsi   
   >      adox (%r11,%r9),  %rsi   
   >   
   >      mov  8(%r11),     %rax   
   >      adcx 8(%r11,%r8), %rax   
   >      adox 8(%r11,%r9), %rax   
   >      mov    %rax, 8(%r10,%r11)   
      
   [snipped the rest]   
      
      
   Very impressive Michael!   
      
   I particularly like how you are interleaving ADOX and ADCX to gain two   
   carry bits without having to save them off to an additional register.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]