Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,646 of 131,241    |
|    Stephen Fuld to Thomas Koenig    |
|    Re: A typical non-loop use case for SIMD    |
|    28 Dec 25 18:51:31    |
      From: sfuld@alumni.cmu.edu.invalid              On 12/26/2025 1:57 PM, Thomas Koenig wrote:       > (This might be blindingly obvious to most regulars, but I thought       > I'd post this, just in case for some discussion)       >       > SIMD is not always about vectorizing loops, they can also be used       > for tree-shaped reductions (not sure what the canonical name is).       >       > Consider the following problem: You have 128 consecutive bytes and       > want to find the minimum value, and you have 512-bit SIMD registers.              Thomas, this is an excellent "test case" as it brings out at least two       issues. There has been discussion in this thread about the "reduction"       problem. Let me start on the other problem, that I call ALU       underutilization. It is caused by requiring lots of simple operations       on small data elements. For this example, I assume a four wide My 66000.              Lets look at just the first pass. I think the simplest coding would       have the VVM loop consisting of two load instructions, two add       instructions to increment the addresses and a min instruction. Letting       VVM do its magic, this would generate 4 byte min operations at a time,       (one per ALU) and thus the loop would be executed 64/4 = 16 times. I       don't know how your hypothetical SIMD machine would do this, but it       might do all 64 min operations in a single operation, or perhaps 2.       This puts VVM at a substantial performance disadvantage.              I have a possible suggestion to help this. I don't claim it is the best       solution.              The problem stems from using only 8 bits of the 64 bit integer ALU for       each operation, leading to more operations. So one possible solution       would be to add a new instruction modifier that tells the system that       any relevant operations under its mask will do the whole register worth       of operations using the size already specified in the the operation.       Since the min instruction would already have specified bytes, with the       modification, the instruction would do 8 byte min operations at once,       this reducing the loop count by a factor of 8. Of course, this       generalized to half words and words as well, and to similar "simple"       instructions such as add/subtract, etc. Note that this already "fits"       in the existing 64 bit ALUs, with the addition of a little logic to       suppress carries, etc. to allow the simultaneous use of all the ALU bits.              Comments?                            --        - Stephen Fuld       (e-mail address disguised to prevent spam)              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca