home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.forth      Forth programmers eat a lot of Bratwurst      117,927 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 117,852 of 117,927   
   peter to Anton Ertl   
   Re: C compiler optimization and Forth en   
   25 Jan 26 23:31:10   
   
   From: peter.noreply@tin.it   
      
   On Sat, 24 Jan 2026 16:47:16 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   > >Hans Bezemer  writes:   
   > >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.   
   > >>I thought, let's make this simple and execute ALL benchmarks I got. Some   
   > >>of them have become useless, though for the simple reason hardware has   
   > >>become that much better.   
   > >>   
   > >>But still, here it is. Overall, the performance consistently   
   > >>deteriorates, aka -O3 gives the best performance.   
   > >   
   > >Which compiler and which hardware?   
   > >   
   > >For a random program, I would expect higher optimization levels to   
   > >produe faster code.  For a Forth system and these recent gccs, the   
   > >auto-vectorization of adjacent memory accesses may lead to similar   
   > >problems as in the C bubble-sort benchmark.  In Gforth, this actually   
   > >happens unless we disable vectorization (which we normally do), and,   
   > >moreover, with the vectorized code, gcc introduces additional   
   > >inefficiencies (see below).   
   > >   
   > >Here's the output of ./gforth-fast onebench.fs compiled from the   
   > >current development version with gcc-12.2 and running on a Ryzen 5800X   
   > >(numbers are times, lower is better):   
   > >   
   > > sieve bubble matrix   fib   fft gcc options   
   > > 0.025  0.023  0.013 0.033 0.016 -O2   
   > > 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)   
   > > 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)   
   > > 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic   
   >   
   > I have now also tried it with gcc-14.2, and that produces better code.   
   > Results from a Xeon E-2388G (Rocket Lake):   
   >   
   >  sieve bubble matrix   fib   fft gcc options   
   >  0.032  0.032  0.015 0.037 0.014 -O2   
   >  0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
   >  0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)   
   >   
   > The code for ROT and 2SWAP does not use auto-vectorization, and the   
   > code for 2! uses auto-vectorization in a way that reduces the   
   > instruction count:   
   >   
   > -O3 (auto-vectorized)     -O3 -fno-tree-vectorize   
   > add    $0x8,%rbx          add $0x8,%rbx   
   > movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax   
   > add    $0x18,%r13         mov 0x8(%r13),%rdx   
   > movhps -0x8(%r13),%xmm0   add $0x18,%r13   
   > movups %xmm0,(%r8)        mov %rdx,(%r8)   
   > mov    0x0(%r13),%r8      mov %rax,0x8(%r8)   
   > mov    (%rbx),%rax        mov 0x0(%r13),%r8   
   > jmp    *%rax              mov (%rbx),%rax   
   >                           jmp *%rax   
   >   
   > And the common tail with all these move instructions is gone.   
   >   
   > - anton   
      
   What does your C code looks like? I could not get clang or gcc to auto   
   vectories   
   with my existing code   
      
     	UNS64 *tmp64 = (UNS64*)TOP;   
           tmp64[0] = sp[0];   
           tmp64[1] = sp[1];   
           TOP = sp[2];   
           sp += 3;   
      
      
   In the end I changed my code to tell the compiler that it is a vector with   
      
   typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute   
   _((aligned(8)));   
      
   and   
           *(v2u64*)TOP = *(v2u64*)sp;   
           TOP=sp[2];   
           sp=sp+3;   
      
   this will produce   
      
   	vmovups	xmm0, xmmword ptr [rdx]   
   	vmovups	xmmword ptr [r8], xmm0   
   	mov	r8, qword ptr [rdx + 16]   
   	add	rdx, 24   
      
   	movzx	r9d, byte ptr [rcx]	// nesting code   
   	inc	rcx   
   	jmp	qword ptr [rax + 8*r9]   
      
   But also using memcpy((UNS64*)TOP, (UNS64*)sp,16); gives the same code!   
      
   Looks like it is working also in ARM64   
   BR   
   Peter   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca