From: peter.noreply@tin.it   
      
   On Thu, 17 Jul 2025 12:54:29 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > peter writes:   
   > >Ryzen 9950X   
   > >   
   > > lxf64   
   > > 5,010,566,495 NAI cycles:u   
   > > 2,011,359,782 UNR cycles:u   
   > > 646,926,001 REC cycles:u   
   > > 3,589,863,082 SR cycles:u   
   > >   
   > > lxf64 =20   
   > > 7,019,247,519 NAI instructions:u =20   
   > > 4,128,689,843 UNR instructions:u =20   
   > > 4,643,499,656 REC instructions:u=20   
   > > 25,019,182,759 SR instructions:u=20   
   > >   
   > >   
   > > gforth-fast 20250219   
   > > 2,048,316,578 NAI cycles:u   
   > > 7,157,520,448 UNR cycles:u   
   > > 3,589,638,677 REC cycles:u   
   > > 17,199,889,916 SR cycles:u   
   > >   
   > > gforth-fast 20250219   
   > > 13,107,999,739 NAI instructions:u=20   
   > > 6,789,041,049 UNR instructions:u   
   > > 9,348,969,966 REC instructions:u=20   
   > > 50,108,032,223 SR instructions:u=20   
   > >   
   > > lxf   
   > > 6,005,617,374 NAI cycles:u   
   > > 6,004,157,635 UNR cycles:u   
   > > 1,303,627,835 REC cycles:u   
   > > 9,187,422,499 SR cycles:u   
   > >   
   > > lxf   
   > > 9,010,888,196 NAI instructions:u   
   > > 4,237,679,129 UNR instructions:u=20   
   > > 4,955,258,040 REC instructions:u=20   
   > > 26,018,680,499 SR instructions:u   
   >   
   > >lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20   
   >   
   > Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5   
   > (visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD   
   > (387) is still 6 cycles (lxf NAI and UNR). I have no explanation why   
   > on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR   
   > so much worse than NAI.   
   >   
   > For REC the latency should not play a role. There lxf64 performs at   
   > 7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77   
   > F+/cycle. My guess is that FADD can only be performed by one FPU, and   
   > that's connected to one dispatch port, and other instructions also   
   > need or are at least assigned to this dispatch port.   
   >   
   > - anton   
      
   I did a test coding the sum128 as a code word with avx-512 instructions   
   and got the following results   
      
    285,584,376 cycles:u   
    941,856,077 instructions:u   
      
   timing was   
   timer-reset ' recursive-sum bench .elapsed 51 ms elapsed   
      
   so half the time of the original recursive.   
   with 32 zmm registers I could have done a sum256 also   
      
   the code is below for reference   
   r13 is the fp stack pointer   
   rbx top of stack   
   xmm0 top of fp stack   
      
   code asum128   
      
   movsd [r13-0x8], xmm0   
   lea r13, [r13-0x8]   
      
   vmovapd zmm0, [rbx]   
   vmovapd zmm1, [rbx+64]   
   vmovapd zmm2, [rbx+128]   
   vmovapd zmm3, [rbx+192]   
   vmovapd zmm4, [rbx+256]   
   vmovapd zmm5, [rbx+320]   
   vmovapd zmm6, [rbx+384]   
   vmovapd zmm7, [rbx+448]   
   vmovapd zmm8, [rbx+512]   
   vmovapd zmm9, [rbx+576]   
   vmovapd zmm10, [rbx+640]   
   vmovapd zmm11, [rbx+704]   
   vmovapd zmm12, [rbx+768]   
   vmovapd zmm13, [rbx+832]   
   vmovapd zmm14, [rbx+896]   
   vmovapd zmm15, [rbx+960]   
      
   vaddpd zmm0, zmm0, zmm1   
   vaddpd zmm2, zmm2, zmm3   
   vaddpd zmm4, zmm4, zmm5   
   vaddpd zmm6, zmm6, zmm7   
   vaddpd zmm8, zmm8, zmm9   
   vaddpd zmm10, zmm10, zmm11   
   vaddpd zmm12, zmm12, zmm13   
   vaddpd zmm14, zmm14, zmm15   
      
   vaddpd zmm0, zmm0, zmm2   
   vaddpd zmm4, zmm4, zmm6   
   vaddpd zmm8, zmm8, zmm10   
   vaddpd zmm12, zmm12, zmm14   
      
   vaddpd zmm0, zmm0, zmm4   
   vaddpd zmm8, zmm8, zmm12   
      
   vaddpd zmm0, zmm0, zmm8   
      
   Horizontal sum of zmm0   
      
   vextractf64x4 ymm1, zmm0, 1   
   vaddpd ymm2, ymm1, ymm0   
      
   vextractf64x2 xmm3, ymm2, 1   
   vaddpd ymm4, ymm3, ymm2   
      
   vhaddpd xmm0, xmm4, xmm4   
      
   ret   
   end-code   
      
   lxf64 uses a modified fasm as the backend assembler   
   so full support for all instructions   
      
   BR   
   Peter   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|