Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,446 of 131,241    |
|    Terje Mathisen to Michael S    |
|    Re: 3-way long addition    |
|    20 Aug 25 10:50:39    |
      From: terje.mathisen@tmsw.no              Michael S wrote:       > On Tue, 19 Aug 2025 05:47:01 GMT       > anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:       >       >>       >> One other problem is that according to Agner Fog's instruction tables,       >> even the latest and greatest CPUs from AMD and Intel that he measured       >> (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,       >       > I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3       > are certainly capable of more than 1 adcx|adox per cycle.       >       > Below are Execution times of very heavily unrolled adcx/adox code with       > dependency broken by trick similiar to above:       >       > Platform RC GM SK Z3       > add3_my_adx_u17 244.5 471.1 482.4 407.0       >       > Considering that there are 2166 adcx/adox/adc instructions, we have       > following number of adcx/adox/adc instructions per clock:       > Platform RC GM SK Z3       > 1.67 1.10 1.05 1.44       >       > For Gracemont and Skylake there exists a possibility of small       > measurement mistake, but Raptor Cove appears to be capable of at least 2       > instructions of this type per clock while Zen3 capable of at least 1.5       > but more likely also 2.       > It looks to me that the bottlenecks on both RC and Z3 are either rename       > phase or more likely L1$ access. It seems that while Golden/Raptore Cove       > can occasionally issue 3 load + 2 stores per clock, it can not sustain       > more than 3 load-or-store accesses per clock       >       >       > Code:       >       > .file "add3_my_adx_u17.s"       > .text       > .p2align 4       > .globl add3       > .def add3; .scl 2; .type 32; .endef       > .seh_proc add3       > add3:       > pushq %rsi       > .seh_pushreg %rsi       > pushq %rbx       > .seh_pushreg %rbx       > .seh_endprologue       > # %rcx - dst       > # %rdx - a       > # %r8 - b       > # %r9 - c       > sub %rdx, %rcx       > mov %rcx, %r10 # r10 = dst - a       > sub %rdx, %r8 # r8 = b - a       > sub %rdx, %r9 # r9 = c - c       > mov %rdx, %r11 # r11 - a       > mov $60, %edx       > xor %ecx, %ecx       > .p2align 4       > .loop:       > xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0       > mov (%r11), %rsi       > adcx (%r11,%r8), %rsi       > adox (%r11,%r9), %rsi       >       > mov 8(%r11), %rax       > adcx 8(%r11,%r8), %rax       > adox 8(%r11,%r9), %rax       > mov %rax, 8(%r10,%r11)              [snipped the rest]                     Very impressive Michael!              I particularly like how you are interleaving ADOX and ADCX to gain two       carry bits without having to save them off to an additional register.              Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca