home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.forth      Forth programmers eat a lot of Bratwurst      117,927 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 117,482 of 117,927   
   peter to Anton Ertl   
   Re: Vector sum (was: Parsing timestamps?   
   19 Jul 25 15:24:48   
   
   From: peter.noreply@tin.it   
      
   On Sat, 19 Jul 2025 10:18:15 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > peter  writes:   
   > >I did a test coding the sum128 as a code word with avx-512 instructions   
   > >and got the following results   
   > >   
   > >       285,584,376      cycles:u   
   > >       941,856,077      instructions:u   
   > >   
   > >timing was   
   > >timer-reset ' recursive-sum bench .elapsed 51 ms elapsed   
   > >   
   > >so half the time of the original recursive.   
   > >with 32 zmm registers I could have done a sum256 also   
   >   
   > One could do sum128 with just 8 registers by performing the adds ASAP,   
   > i.e., for sum32   
   >   
   > vmovapd   zmm0,  [rbx]   
   > vmovapd   zmm1,  [rbx+64]   
   > vaddpd  zmm0, zmm0, zmm1   
   > vmovapd   zmm1,  [rbx+128]   
   > vmovapd   zmm2,  [rbx+192]   
   > vaddpd  zmm1, zmm1, zmm2   
   > vaddpd  zmm0, zmm0, zmm1   
   > ; and then the Horizontal sum   
   >   
   > And you can code this as:   
   >   
   > vmovapd   zmm0,  [rbx]   
   > vaddpd  zmm0, zmm0, [rbx+64]   
   > vmovapd   zmm1,  [rbx+128]   
   > vaddpd  zmm1, zmm1, [rbx+192]   
   > vaddpd  zmm0, zmm0, zmm1   
   > ; and then the Horizontal sum   
   >   
   > >; Horizontal sum of zmm0   
   > >   
   > >vextractf64x4 ymm1, zmm0, 1   
   > >vaddpd ymm2, ymm1, ymm0   
   > >   
   > >vextractf64x2 xmm3, ymm2, 1   
   > >vaddpd ymm4, ymm3, ymm2   
   > >   
   > >vhaddpd xmm0, xmm4, xmm4   
      
   the simd instructions does also take a memory operand   
   I can du sum128 as   
      
   code asum128b   
      
   movsd [r13-0x8], xmm0   
   lea r13, [r13-0x8]   
      
   vmovapd zmm0,  [rbx]   
   vaddpd  zmm0, zmm0,  [rbx+64]   
   vaddpd  zmm0, zmm0,  [rbx+128]   
   vaddpd  zmm0, zmm0,  [rbx+192]   
   vaddpd  zmm0, zmm0,  [rbx+256]   
   vaddpd  zmm0, zmm0,  [rbx+320]   
   vaddpd  zmm0, zmm0,  [rbx+384]   
   vaddpd  zmm0, zmm0,  [rbx+448]   
   vaddpd  zmm0, zmm0,  [rbx+512]   
   vaddpd  zmm0, zmm0,  [rbx+576]   
   vaddpd  zmm0, zmm0,  [rbx+640]   
   vaddpd  zmm0, zmm0,  [rbx+704]   
   vaddpd  zmm0, zmm0,  [rbx+768]   
   vaddpd  zmm0, zmm0,  [rbx+832]   
   vaddpd  zmm0, zmm0,  [rbx+896]   
   vaddpd  zmm0, zmm0,  [rbx+960]   
      
      
    Horizontal sum of zmm0   
      
   vextractf64x4 ymm1, zmm0, 1   
   vaddpd ymm2, ymm1, ymm0   
      
   vextractf64x2 xmm3, ymm2, 1   
   vaddpd ymm4, ymm3, ymm2   
      
   vpermilpd  xmm5, xmm4, 1   
   vaddsd xmm0, xmm4, xmm5   
      
      
   ret   
   end-code   
      
   this compiles to 154 bytes and 25 instructions   
   The original sum128 is 2157 bytes and 513 instructions!   
      
   Yes the horizontal sum should just be done once.   
   I have only replaced sum128 with simd as a test.   
   Later I will do a complete example   
      
   This asum128b does not change the timing but reduces   
   the number of instructions   
      
          277,333,790      cycles:u   
          834,846,183      instructions:u    #    3.01  insn per cycle   
      
      
   >   
   > Instead of doing the horizontal sum once for every sum128, it might be   
   > more efficient (assuming the whole thing is not   
   > cache-bandwidth-limited) to have the result of sum128 be a full SIMD   
   > width, and then add them up with vaddpd instead of addsd, and do the   
   > horizontal sum once in the end.   
   >   
   > But if the recursive part is to be programmed in Forth, we would need   
   > a way to represent a SIMD width of data in Forth, maybe with a SIMD   
   > stack.  I see a few problems there:   
   >   
   > * What to do about the mask registers of AVX-512?  In the RISC-V   
   >   vector extension masks are stored in regular SIMD registers.   
   >   
   > * There is a trend visible in ARM SVE and the RISC-V Vector extension   
   >   to have support for dealing with loops across longer vectors.  Do we   
   >   also need to support something like that.   
   >   
   > For the RISC-V vector extension, see   
   >    
   >   
   > One way to deal with all that would be to have a long-vector stack and   
   > have something like my vector wordset   
   > , where the sum of a vector   
   > would be a word that is implemented in some lower-level way (e.g.,   
   > assembly language); the sum of a vector is actually a planned, but not   
   > yet existing feature of this wordset.   
   >   
   > An advantage of having a (short) SIMD stack would be that one could   
   > use SIMD operations for other uses where the long-vector wordset looks   
   > too heavy-weight (or would need optimizations to get rid of the   
   > long-vector overhead).  The question is if enough such uses exist to   
   > justify adding such a stack.   
   >   
   > - anton   
      
   I will take a look at your vector implementation and see if it can be used   
   in lxf64   
      
   BR   
   Peter   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca