From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >Hans Bezemer writes:   
   >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.   
   >>I thought, let's make this simple and execute ALL benchmarks I got. Some   
   >>of them have become useless, though for the simple reason hardware has   
   >>become that much better.   
   >>   
   >>But still, here it is. Overall, the performance consistently   
   >>deteriorates, aka -O3 gives the best performance.   
   >   
   >Which compiler and which hardware?   
   >   
   >For a random program, I would expect higher optimization levels to   
   >produe faster code. For a Forth system and these recent gccs, the   
   >auto-vectorization of adjacent memory accesses may lead to similar   
   >problems as in the C bubble-sort benchmark. In Gforth, this actually   
   >happens unless we disable vectorization (which we normally do), and,   
   >moreover, with the vectorized code, gcc introduces additional   
   >inefficiencies (see below).   
   >   
   >Here's the output of ./gforth-fast onebench.fs compiled from the   
   >current development version with gcc-12.2 and running on a Ryzen 5800X   
   >(numbers are times, lower is better):   
   >   
   > sieve bubble matrix fib fft gcc options   
   > 0.025 0.023 0.013 0.033 0.016 -O2   
   > 0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)   
   > 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)   
   > 0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic   
      
   I have now also tried it with gcc-14.2, and that produces better code.   
   Results from a Xeon E-2388G (Rocket Lake):   
      
    sieve bubble matrix fib fft gcc options   
    0.032 0.032 0.015 0.037 0.014 -O2   
    0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
    0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)   
      
   The code for ROT and 2SWAP does not use auto-vectorization, and the   
   code for 2! uses auto-vectorization in a way that reduces the   
   instruction count:   
      
   -O3 (auto-vectorized) -O3 -fno-tree-vectorize   
   add $0x8,%rbx add $0x8,%rbx   
   movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax   
   add $0x18,%r13 mov 0x8(%r13),%rdx   
   movhps -0x8(%r13),%xmm0 add $0x18,%r13   
   movups %xmm0,(%r8) mov %rdx,(%r8)   
   mov 0x0(%r13),%r8 mov %rax,0x8(%r8)   
   mov (%rbx),%rax mov 0x0(%r13),%r8   
   jmp *%rax mov (%rbx),%rax   
    jmp *%rax   
      
   And the common tail with all these move instructions is gone.   
      
   - anton   
   --   
   M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
    New standard: https://forth-standard.org/   
   EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|