From: peter.noreply@tin.it   
      
   On Sat, 24 Jan 2026 16:47:16 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   > >Hans Bezemer writes:   
   > >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.   
   > >>I thought, let's make this simple and execute ALL benchmarks I got. Some   
   > >>of them have become useless, though for the simple reason hardware has   
   > >>become that much better.   
   > >>   
   > >>But still, here it is. Overall, the performance consistently   
   > >>deteriorates, aka -O3 gives the best performance.   
   > >   
   > >Which compiler and which hardware?   
   > >   
   > >For a random program, I would expect higher optimization levels to   
   > >produe faster code. For a Forth system and these recent gccs, the   
   > >auto-vectorization of adjacent memory accesses may lead to similar   
   > >problems as in the C bubble-sort benchmark. In Gforth, this actually   
   > >happens unless we disable vectorization (which we normally do), and,   
   > >moreover, with the vectorized code, gcc introduces additional   
   > >inefficiencies (see below).   
   > >   
   > >Here's the output of ./gforth-fast onebench.fs compiled from the   
   > >current development version with gcc-12.2 and running on a Ryzen 5800X   
   > >(numbers are times, lower is better):   
   > >   
   > > sieve bubble matrix fib fft gcc options   
   > > 0.025 0.023 0.013 0.033 0.016 -O2   
   > > 0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)   
   > > 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)   
   > > 0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic   
   >   
   > I have now also tried it with gcc-14.2, and that produces better code.   
   > Results from a Xeon E-2388G (Rocket Lake):   
   >   
   > sieve bubble matrix fib fft gcc options   
   > 0.032 0.032 0.015 0.037 0.014 -O2   
   > 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
   > 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)   
   >   
   > The code for ROT and 2SWAP does not use auto-vectorization, and the   
   > code for 2! uses auto-vectorization in a way that reduces the   
   > instruction count:   
   >   
   > -O3 (auto-vectorized) -O3 -fno-tree-vectorize   
   > add $0x8,%rbx add $0x8,%rbx   
   > movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax   
   > add $0x18,%r13 mov 0x8(%r13),%rdx   
   > movhps -0x8(%r13),%xmm0 add $0x18,%r13   
   > movups %xmm0,(%r8) mov %rdx,(%r8)   
   > mov 0x0(%r13),%r8 mov %rax,0x8(%r8)   
   > mov (%rbx),%rax mov 0x0(%r13),%r8   
   > jmp *%rax mov (%rbx),%rax   
   > jmp *%rax   
   >   
   > And the common tail with all these move instructions is gone.   
   >   
   > - anton   
      
   What does your C code looks like? I could not get clang or gcc to auto   
   vectories   
   with my existing code   
      
    UNS64 *tmp64 = (UNS64*)TOP;   
    tmp64[0] = sp[0];   
    tmp64[1] = sp[1];   
    TOP = sp[2];   
    sp += 3;   
      
      
   In the end I changed my code to tell the compiler that it is a vector with   
      
   typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute   
   _((aligned(8)));   
      
   and   
    *(v2u64*)TOP = *(v2u64*)sp;   
    TOP=sp[2];   
    sp=sp+3;   
      
   this will produce   
      
    vmovups xmm0, xmmword ptr [rdx]   
    vmovups xmmword ptr [r8], xmm0   
    mov r8, qword ptr [rdx + 16]   
    add rdx, 24   
      
    movzx r9d, byte ptr [rcx] // nesting code   
    inc rcx   
    jmp qword ptr [rax + 8*r9]   
      
   But also using memcpy((UNS64*)TOP, (UNS64*)sp,16); gives the same code!   
      
   Looks like it is working also in ARM64   
   BR   
   Peter   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|