From: peter.noreply@tin.it   
      
   On Mon, 26 Jan 2026 19:24:43 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > peter writes:   
   > >On Sat, 24 Jan 2026 16:47:16 GMT   
   > >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   > >> I have now also tried it with gcc-14.2, and that produces better code.   
   > >> Results from a Xeon E-2388G (Rocket Lake):   
   > >>   
   > >> sieve bubble matrix fib fft gcc options   
   > >> 0.032 0.032 0.015 0.037 0.014 -O2   
   > >> 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
   > >> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)   
   > >>   
   > >> The code for ROT and 2SWAP does not use auto-vectorization, and the   
   > >> code for 2! uses auto-vectorization in a way that reduces the   
   > >> instruction count:   
   > >>   
   > >> -O3 (auto-vectorized) -O3 -fno-tree-vectorize   
   > >> add $0x8,%rbx add $0x8,%rbx   
   > >> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax   
   > >> add $0x18,%r13 mov 0x8(%r13),%rdx   
   > >> movhps -0x8(%r13),%xmm0 add $0x18,%r13   
   > >> movups %xmm0,(%r8) mov %rdx,(%r8)   
   > >> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)   
   > >> mov (%rbx),%rax mov 0x0(%r13),%r8   
   > >> jmp *%rax mov (%rbx),%rax   
   > >> jmp *%rax   
   > >>   
   > >> And the common tail with all these move instructions is gone.   
   > >>   
   > >> - anton   
   > >   
   > >What does your C code looks like? I could not get clang or gcc to auto   
   vectories   
   > >with my existing code   
   > >   
   > > UNS64 *tmp64 = (UNS64*)TOP;   
   > > tmp64[0] = sp[0];   
   > > tmp64[1] = sp[1];   
   > > TOP = sp[2];   
   > > sp += 3;   
   >   
   > Gforth's source code for 2! is:   
   >   
   > 2! ( w1 w2 a_addr -- ) core two_store   
   > ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   >   
   > A generator produces the following from that, which is passed to gcc:   
   >   
   > LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */   
   > /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */   
   > NAME("2!")   
   > ip += 1;   
   > LABEL1(two_store)   
   > {   
   > DEF_CA   
   > MAYBE_UNUSED Cell w1;   
   > MAYBE_UNUSED Cell w2;   
   > MAYBE_UNUSED Cell * a_addr;   
   > NEXT_P0;   
   > vm_Cell2w(sp[2],w1);   
   > vm_Cell2w(sp[1],w2);   
   > vm_Cell2a_(spTOS,a_addr);   
   > #ifdef VM_DEBUG   
   > if (vm_debug) {   
   > fputs(" w1=", vm_out); printarg_w(w1);   
   > fputs(" w2=", vm_out); printarg_w(w2);   
   > fputs(" a_addr=", vm_out); printarg_a_(a_addr);   
   > }   
   > #endif   
   > sp += 3;   
   > {   
   > #line 1815 "prim"   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   > #line 10136 "prim-fast.i"   
   > }   
   >   
   > #ifdef VM_DEBUG   
   > if (vm_debug) {   
   > fputs(" -- ", vm_out); fputc('\n', vm_out);   
   > }   
   > #endif   
   > NEXT_P1;   
   > spTOS = sp[0];   
   > LABEL2(two_store)   
   > NAME1("l2-two_store")   
   > NEXT_P1_5;   
   > LABEL3(two_store)   
   > NAME1("l3-two_store")   
   > DO_GOTO;   
   > }   
   >   
   > There are a lot of macros in this code, and I fear that expanding them   
   > makes the code even less readable, but the essence for the   
   > auto-vectorized part is something like:   
   >   
   > w1 = sp[2];   
   > w2 = sp[1];   
   > a_addr = spTOS;   
   > sp += 3;   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   > spTOS = sp[0];   
   >   
   > My guess is that in your code the compiler expected that sp[1] might   
   > alias with tmp64[0], and therefore did not vectorize the loads and the   
   > stores, whereas in the Gforth code, the loads both happen first, and   
   > then the two stores, and gcc can vectorize that. I doubt that there   
   > is a big benefit from that, though.   
      
   Yes that was it. changing to:   
      
    UNS64 *tmp64 = (UNS64*)TOP;   
    UNS64 d0=sp[0];   
    UNS64 d1=sp[1];   
    tmp64[0] = d0;   
    tmp64[1] = d1;   
    TOP = sp[2];   
    sp += 3;   
      
   made the compiler (clang-21 in this case) generate the expected code   
      
      
   >   
   > >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attrib   
   te__((aligned(8)));   
   >   
   > I'll have to remember the aligned attribute for future games with gcc   
   > explicit vectorization.   
      
   Without that it will generate the opcodes that needs 16 byte alignment   
      
   BR   
   Peter   
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|