home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.forth      Forth programmers eat a lot of Bratwurst      117,927 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 117,856 of 117,927   
   peter to Anton Ertl   
   Re: C compiler optimization and Forth en   
   27 Jan 26 15:44:55   
   
   From: peter.noreply@tin.it   
      
   On Mon, 26 Jan 2026 19:24:43 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > peter  writes:   
   > >On Sat, 24 Jan 2026 16:47:16 GMT   
   > >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   > >> I have now also tried it with gcc-14.2, and that produces better code.   
   > >> Results from a Xeon E-2388G (Rocket Lake):   
   > >>   
   > >>  sieve bubble matrix   fib   fft gcc options   
   > >>  0.032  0.032  0.015 0.037 0.014 -O2   
   > >>  0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
   > >>  0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)   
   > >>   
   > >> The code for ROT and 2SWAP does not use auto-vectorization, and the   
   > >> code for 2! uses auto-vectorization in a way that reduces the   
   > >> instruction count:   
   > >>   
   > >> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize   
   > >> add    $0x8,%rbx          add $0x8,%rbx   
   > >> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax   
   > >> add    $0x18,%r13         mov 0x8(%r13),%rdx   
   > >> movhps -0x8(%r13),%xmm0   add $0x18,%r13   
   > >> movups %xmm0,(%r8)        mov %rdx,(%r8)   
   > >> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)   
   > >> mov    (%rbx),%rax        mov 0x0(%r13),%r8   
   > >> jmp    *%rax              mov (%rbx),%rax   
   > >>                           jmp *%rax   
   > >>   
   > >> And the common tail with all these move instructions is gone.   
   > >>   
   > >> - anton   
   > >   
   > >What does your C code looks like? I could not get clang or gcc to auto   
   vectories   
   > >with my existing code   
   > >   
   > >  	UNS64 *tmp64 = (UNS64*)TOP;   
   > >        tmp64[0] = sp[0];   
   > >        tmp64[1] = sp[1];   
   > >        TOP = sp[2];   
   > >        sp += 3;   
   >   
   > Gforth's source code for 2! is:   
   >   
   > 2!	( w1 w2 a_addr -- )		core	two_store   
   > ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   >   
   > A generator produces the following from that, which is passed to gcc:   
   >   
   > LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1  */   
   > /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */   
   > NAME("2!")   
   > ip += 1;   
   > LABEL1(two_store)   
   > {   
   > DEF_CA   
   > MAYBE_UNUSED Cell w1;   
   > MAYBE_UNUSED Cell w2;   
   > MAYBE_UNUSED Cell * a_addr;   
   > NEXT_P0;   
   > vm_Cell2w(sp[2],w1);   
   > vm_Cell2w(sp[1],w2);   
   > vm_Cell2a_(spTOS,a_addr);   
   > #ifdef VM_DEBUG   
   > if (vm_debug) {   
   > fputs(" w1=", vm_out); printarg_w(w1);   
   > fputs(" w2=", vm_out); printarg_w(w2);   
   > fputs(" a_addr=", vm_out); printarg_a_(a_addr);   
   > }   
   > #endif   
   > sp += 3;   
   > {   
   > #line 1815 "prim"   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   > #line 10136 "prim-fast.i"   
   > }   
   >   
   > #ifdef VM_DEBUG   
   > if (vm_debug) {   
   > fputs(" -- ", vm_out); fputc('\n', vm_out);   
   > }   
   > #endif   
   > NEXT_P1;   
   > spTOS = sp[0];   
   > LABEL2(two_store)   
   > NAME1("l2-two_store")   
   > NEXT_P1_5;   
   > LABEL3(two_store)   
   > NAME1("l3-two_store")   
   > DO_GOTO;   
   > }   
   >   
   > There are a lot of macros in this code, and I fear that expanding them   
   > makes the code even less readable, but the essence for the   
   > auto-vectorized part is something like:   
   >   
   > w1 = sp[2];   
   > w2 = sp[1];   
   > a_addr = spTOS;   
   > sp += 3;   
   > a_addr[0] = w2;   
   > a_addr[1] = w1;   
   > spTOS = sp[0];   
   >   
   > My guess is that in your code the compiler expected that sp[1] might   
   > alias with tmp64[0], and therefore did not vectorize the loads and the   
   > stores, whereas in the Gforth code, the loads both happen first, and   
   > then the two stores, and gcc can vectorize that.  I doubt that there   
   > is a big benefit from that, though.   
      
   Yes that was it. changing to:   
      
   	UNS64 *tmp64 = (UNS64*)TOP;   
           UNS64 d0=sp[0];   
           UNS64 d1=sp[1];   
           tmp64[0] = d0;   
           tmp64[1] = d1;   
           TOP = sp[2];   
           sp += 3;   
      
   made the compiler (clang-21 in this case) generate the expected code   
      
      
   >   
   > >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attrib   
   te__((aligned(8)));   
   >   
   > I'll have to remember the aligned attribute for future games with gcc   
   > explicit vectorization.   
      
   Without that it will generate the opcodes that needs 16 byte alignment   
      
   BR   
   Peter   
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca