... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,658 of 131,241
MitchAlsup to All
Re: A typical non-loop use case for SIMD
29 Dec 25 19:59:38
   From: user5857@newsgrouper.org.invalid   
      
   Stephen Fuld  posted:   
      
   > On 12/26/2025 1:57 PM, Thomas Koenig wrote:   
   > > (This might be blindingly obvious to most regulars, but I thought   
   > > I'd post this, just in case for some discussion)   
   > >   
   > > SIMD is not always about vectorizing loops, they can also be used   
   > > for tree-shaped reductions (not sure what the canonical name is).   
   > >   
   > > Consider the following problem:  You have 128 consecutive bytes and   
   > > want to find the minimum value, and you have 512-bit SIMD registers.   
   >   
   > Thomas, this is an excellent "test case" as it brings out at least two   
   > issues.  There has been discussion in this thread about the "reduction"   
   > problem.  Let me start on the other problem, that I call ALU   
   > underutilization.  It is caused by requiring lots of simple operations   
   > on small data elements.  For this example, I assume a four wide My 66000.   
   >   
   > Lets look at just the first pass.  I think the simplest coding would   
   > have the VVM loop consisting of two load instructions, two add   
   > instructions to increment the addresses and a min instruction.  Letting   
   > VVM do its magic, this would generate 4 byte min operations at a time,   
   > (one per ALU) and thus the loop would be executed 64/4 = 16 times.  I   
   > don't know how your hypothetical SIMD machine would do this, but it   
   > might do all 64 min operations in a single operation, or perhaps 2.   
   > This puts VVM at a substantial performance disadvantage.   
   >   
   > I have a possible suggestion to help this.  I don't claim it is the best   
   > solution.   
   >   
   > The problem stems from using only 8 bits of the 64 bit integer ALU for   
   > each operation, leading to more operations.  So one possible solution   
   > would be to add a new instruction modifier that tells the system that   
   > any relevant operations under its mask will do the whole register worth   
   > of operations using the size already specified in the the operation.   
      
   This is exactly what VVM does, BTW. Smaller than register widths are   
   SIMDed into single "units of work" up to register width and performed   
   with the carry-chains clipped.   
      
   > Since the min instruction would already have specified bytes,   
      
   It is the memory instruction that specifies data width.   
      
   >                                                               with the   
   > modification, the instruction would do 8 byte min operations at once,   
   > this reducing the loop count by a factor of 8.   
      
   Yes, exactly.   
      
   >                                                 Of course, this   
   > generalized to half words and words as well, and to similar "simple"   
   > instructions such as add/subtract, etc.   
      
   All that is modified is where the carry-chains get clipped.   
      
   >                                          Note that this already "fits"   
   > in the existing 64 bit ALUs, with the addition of a little logic to   
   > suppress carries, etc. to allow the simultaneous use of all the ALU bits.   
   >   
   > Comments?   
   >   
   >   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]