From: anton@mips.complang.tuwien.ac.at   
      
   peter writes:   
   >On Sat, 24 Jan 2026 16:47:16 GMT   
   >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   >> I have now also tried it with gcc-14.2, and that produces better code.   
   >> Results from a Xeon E-2388G (Rocket Lake):   
   >>   
   >> sieve bubble matrix fib fft gcc options   
   >> 0.032 0.032 0.015 0.037 0.014 -O2   
   >> 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
   >> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)   
   >>   
   >> The code for ROT and 2SWAP does not use auto-vectorization, and the   
   >> code for 2! uses auto-vectorization in a way that reduces the   
   >> instruction count:   
   >>   
   >> -O3 (auto-vectorized) -O3 -fno-tree-vectorize   
   >> add $0x8,%rbx add $0x8,%rbx   
   >> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax   
   >> add $0x18,%r13 mov 0x8(%r13),%rdx   
   >> movhps -0x8(%r13),%xmm0 add $0x18,%r13   
   >> movups %xmm0,(%r8) mov %rdx,(%r8)   
   >> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)   
   >> mov (%rbx),%rax mov 0x0(%r13),%r8   
   >> jmp *%rax mov (%rbx),%rax   
   >> jmp *%rax   
   >>   
   >> And the common tail with all these move instructions is gone.   
   >>   
   >> - anton   
   >   
   >What does your C code looks like? I could not get clang or gcc to auto   
   vectories   
   >with my existing code   
   >   
   > UNS64 *tmp64 = (UNS64*)TOP;   
   > tmp64[0] = sp[0];   
   > tmp64[1] = sp[1];   
   > TOP = sp[2];   
   > sp += 3;   
      
   Gforth's source code for 2! is:   
      
   2! ( w1 w2 a_addr -- ) core two_store   
   ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""   
   a_addr[0] = w2;   
   a_addr[1] = w1;   
      
   A generator produces the following from that, which is passed to gcc:   
      
   LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */   
   /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */   
   NAME("2!")   
   ip += 1;   
   LABEL1(two_store)   
   {   
   DEF_CA   
   MAYBE_UNUSED Cell w1;   
   MAYBE_UNUSED Cell w2;   
   MAYBE_UNUSED Cell * a_addr;   
   NEXT_P0;   
   vm_Cell2w(sp[2],w1);   
   vm_Cell2w(sp[1],w2);   
   vm_Cell2a_(spTOS,a_addr);   
   #ifdef VM_DEBUG   
   if (vm_debug) {   
   fputs(" w1=", vm_out); printarg_w(w1);   
   fputs(" w2=", vm_out); printarg_w(w2);   
   fputs(" a_addr=", vm_out); printarg_a_(a_addr);   
   }   
   #endif   
   sp += 3;   
   {   
   #line 1815 "prim"   
   a_addr[0] = w2;   
   a_addr[1] = w1;   
   #line 10136 "prim-fast.i"   
   }   
      
   #ifdef VM_DEBUG   
   if (vm_debug) {   
   fputs(" -- ", vm_out); fputc('\n', vm_out);   
   }   
   #endif   
   NEXT_P1;   
   spTOS = sp[0];   
   LABEL2(two_store)   
   NAME1("l2-two_store")   
   NEXT_P1_5;   
   LABEL3(two_store)   
   NAME1("l3-two_store")   
   DO_GOTO;   
   }   
      
   There are a lot of macros in this code, and I fear that expanding them   
   makes the code even less readable, but the essence for the   
   auto-vectorized part is something like:   
      
   w1 = sp[2];   
   w2 = sp[1];   
   a_addr = spTOS;   
   sp += 3;   
   a_addr[0] = w2;   
   a_addr[1] = w1;   
   spTOS = sp[0];   
      
   My guess is that in your code the compiler expected that sp[1] might   
   alias with tmp64[0], and therefore did not vectorize the loads and the   
   stores, whereas in the Gforth code, the loads both happen first, and   
   then the two stores, and gcc can vectorize that. I doubt that there   
   is a big benefit from that, though.   
      
   >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribut   
   __((aligned(8)));   
      
   I'll have to remember the aligned attribute for future games with gcc   
   explicit vectorization.   
      
   - anton   
   --   
   M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
    New standard: https://forth-standard.org/   
   EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|