home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 130,646 of 131,241   
   Stephen Fuld to Thomas Koenig   
   Re: A typical non-loop use case for SIMD   
   28 Dec 25 18:51:31   
   
   From: sfuld@alumni.cmu.edu.invalid   
      
   On 12/26/2025 1:57 PM, Thomas Koenig wrote:   
   > (This might be blindingly obvious to most regulars, but I thought   
   > I'd post this, just in case for some discussion)   
   >   
   > SIMD is not always about vectorizing loops, they can also be used   
   > for tree-shaped reductions (not sure what the canonical name is).   
   >   
   > Consider the following problem:  You have 128 consecutive bytes and   
   > want to find the minimum value, and you have 512-bit SIMD registers.   
      
   Thomas, this is an excellent "test case" as it brings out at least two   
   issues.  There has been discussion in this thread about the "reduction"   
   problem.  Let me start on the other problem, that I call ALU   
   underutilization.  It is caused by requiring lots of simple operations   
   on small data elements.  For this example, I assume a four wide My 66000.   
      
   Lets look at just the first pass.  I think the simplest coding would   
   have the VVM loop consisting of two load instructions, two add   
   instructions to increment the addresses and a min instruction.  Letting   
   VVM do its magic, this would generate 4 byte min operations at a time,   
   (one per ALU) and thus the loop would be executed 64/4 = 16 times.  I   
   don't know how your hypothetical SIMD machine would do this, but it   
   might do all 64 min operations in a single operation, or perhaps 2.   
   This puts VVM at a substantial performance disadvantage.   
      
   I have a possible suggestion to help this.  I don't claim it is the best   
   solution.   
      
   The problem stems from using only 8 bits of the 64 bit integer ALU for   
   each operation, leading to more operations.  So one possible solution   
   would be to add a new instruction modifier that tells the system that   
   any relevant operations under its mask will do the whole register worth   
   of operations using the size already specified in the the operation.   
   Since the min instruction would already have specified bytes, with the   
   modification, the instruction would do 8 byte min operations at once,   
   this reducing the loop count by a factor of 8.  Of course, this   
   generalized to half words and words as well, and to similar "simple"   
   instructions such as add/subtract, etc.  Note that this already "fits"   
   in the existing 64 bit ALUs, with the addition of a little logic to   
   suppress carries, etc. to allow the simultaneous use of all the ALU bits.   
      
   Comments?   
      
      
      
   --   
     - Stephen Fuld   
   (e-mail address disguised to prevent spam)   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca