... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,614 of 131,241
MitchAlsup to All
Re: A typical non-loop use case for SIMD
27 Dec 25 21:07:10
   From: user5857@newsgrouper.org.invalid   
      
   MitchAlsup  posted:   
      
   >   
   > Thomas Koenig  posted:   
   >   
   > > (This might be blindingly obvious to most regulars, but I thought   
   > > I'd post this, just in case for some discussion)   
   > >   
   > > SIMD is not always about vectorizing loops, they can also be used   
   > > for tree-shaped reductions (not sure what the canonical name is).   
   >   
   > I use the term "treeification" for this. It is useful in both SW and   
   > HW applications.   
      
   Without the logarithmic attempts, the general word used to describe   
   these things is "reduction".   
      
   > > Consider the following problem:  You have 128 consecutive bytes and   
   > > want to find the minimum value, and you have 512-bit SIMD registers.   
   > >   
   > > You load the values into two 512-bit = 64 byte registers A and B, into   
   > > positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.   
   >   
   > Since B is A+512 and 512 = ½ of the size, it is akin to how Cooley   
   > and Tukey turned DFT into FFT--AND THIS is the key to many large scale   
   > reduction processes. It also happens to be the way to Vectorize FFTs   
   > for event faster performance on machines with vector capabilities.   
   >   
   > > You do a SIMD 8-bit "min" operation.  The target c register contains   
   > > the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ...   
   > > min(A<504:511>,b<504:511>).  C need not be distinct from A or B.   
   > > You then have 64 values.   
   > >   
   > > You then move the upper values of C into register D (which need not   
   > > be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)   
   > > etc.   
   > >   
   > > You then do the min operation with half the length of the registers,   
   > > giving you 32 values.   
   > >   
   > > And so on, until you have the single value, which is reached in   
   > > seven steps.   
   > >   
   > > This kind of thing is, AFAIK, used in high-performance code such   
   > > as JSON parsers or regexp matchers.  An example (in principle)   
   > > can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .   
   > >   
   > > Just wondering... is this somethig that VVM or similar could   
   > > also do? Or does this actually require SIMD and the necessary   
   > > shift, rotate or permute intructions?   
   >   
   > I will get back to you on this.   
      
   After considerable thought::   
      
   VVM has the ability to choose execution width (based on HW resources   
   and based on data recurrence). In the past I have given examples   
   where VVM is executing at "width" and then because of a memory   
   "alias" has to drop back to 1-wide until the pointers cross before   
   reverting back to full width.   
      
   This algorithm (reduction) is a modification to dynamic width control,   
   where width is constant until the final K iterations and then decreases   
   by ½ each iteration thereafter. So, fundamentally, VVM does not have   
   a problem with reductions "expressed right".   
      
   However, the given problem of 512-bits (64-bytes) might not find much   
   if any speedup, due to initialization, and a potential stutter step   
   on each DIV-2 iteration.   
      
   It might be better to allow the HW to recognize some inst have   
   certain properties and integrate those into VVM recognition so   
   that VVM performs a wide calculation; roughly akin to the   
   following:   
      
      for(...)   
        local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;   
      for(...)   
        global_minimum = MIN( local_minimum[i,i+7] ); i+=8;   
      
   For sizes as small as 512-bits, VVM might not have an advantage. On the   
   other hand, if HW knew certain things about some instructions, the top   
   loop might be performed simultaneously with the bottom loop--more or   
   less like having an adder that performs {8×64, 16×32, 32×16, 64×8}   
   calculations simultaneously in reduction form {Exact for integer,   
   Single rounding for 2^n FPs} and this wide adder feeds the second   
   calculation 1 K× reduction per cycle in a single merged loop.   
      
   Needs more thought. {Known problem}   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]