... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.arch

Apparently more than just beeps & boops

131,241 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 130,613 of 131,241

Anton Ertl to Robert Finch

Re: A typical non-loop use case for SIMD

27 Dec 25 07:46:33

   From: anton@mips.complang.tuwien.ac.at   

   Robert Finch  writes:   
   >RISCV IIRC has several reduction operations including min that finds the   
   >minimum of all the values in a vector register   

   That's very appropriate for a vector extension that allows different   
   SIMD-widths such as the RISC-V vector extensions and ARM SVE.  It also   
   reduces the amount of code at the end of reduction loops and may   
   reduce the latency (depending on the implementation).   

   >so I think it does not   
   >need a tree.   

   A reduction loop may have to perform several independent reduction   
   recurrences in parallel to have a good utilization of the SIMD units.   
   E.g., if the microarchitecture has an FP addition latency of 3 cycles   
   and can perform 2 SIMD-width FP additions per cycle (and has enough   
   other resources, e.g., 2 SIMD-width loads per cycle), 3*2=6   
   parallel strands of reduction operations are necessary.  And then   
   in the end you combine these 6 strands with a tree to minimize the   
   latency, and then you can combine the SIMD result with a reduction   
   instruction.   

   The approach outlined above makes the code somewhat   
   microarchitecture-dependent; catering for more parallel strands then   
   necessary does not hurt much, mainly in needing more registers, but if   
   you have too few strands, the result will be slower than the CPU is   
   capable of.   

   An alternative is to use a tree reduction (to SIMD width) inside the   
   loop at every step.  E.g., for an FP addition reduction of 8 SIMD   
   widths per loop iteration:   

   /* all r variables and array elements are SIMD width,   
      see GNU C vector extensions */   
   r0 = 0;   
   for (...) {   
     r1 =  a[i+0];   
     r1 += a[i+1];   
     r2 =  a[i+2];   
     r2 += a[i+3];   
     r1 += r2;   
     r2 =  a[i+4];   
     r2 += a[i+5];   
     r3 =  a[i+6];   
     r3 += a[i+7];   
     r2 += r3;   
     r1 += r2;   
     r0 += r1; /* the only recurrence in this loop */   
   }   
   ... /* now reduce r0 to a scalar */   

   Note that fewer register names are needed than for an 8-strand   
   reduction loop that offers the same amount of instruction-level   
   parallelism (ILP).   

   You can easily increase this to 32 SIMD widths per iteration, which   
   should be good enough for CPUs in the next decade or two.   

   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
     Mitch Alsup,    

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]