... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 117,851 of 117,927
Anton Ertl to Anton Ertl
Re: C compiler optimization and Forth en
24 Jan 26 16:47:16
   From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >Hans Bezemer  writes:   
   >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.   
   >>I thought, let's make this simple and execute ALL benchmarks I got. Some   
   >>of them have become useless, though for the simple reason hardware has   
   >>become that much better.   
   >>   
   >>But still, here it is. Overall, the performance consistently   
   >>deteriorates, aka -O3 gives the best performance.   
   >   
   >Which compiler and which hardware?   
   >   
   >For a random program, I would expect higher optimization levels to   
   >produe faster code.  For a Forth system and these recent gccs, the   
   >auto-vectorization of adjacent memory accesses may lead to similar   
   >problems as in the C bubble-sort benchmark.  In Gforth, this actually   
   >happens unless we disable vectorization (which we normally do), and,   
   >moreover, with the vectorized code, gcc introduces additional   
   >inefficiencies (see below).   
   >   
   >Here's the output of ./gforth-fast onebench.fs compiled from the   
   >current development version with gcc-12.2 and running on a Ryzen 5800X   
   >(numbers are times, lower is better):   
   >   
   > sieve bubble matrix   fib   fft gcc options   
   > 0.025  0.023  0.013 0.033 0.016 -O2   
   > 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)   
   > 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)   
   > 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic   
      
   I have now also tried it with gcc-14.2, and that produces better code.   
   Results from a Xeon E-2388G (Rocket Lake):   
      
    sieve bubble matrix   fib   fft gcc options   
    0.032  0.032  0.015 0.037 0.014 -O2   
    0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)   
    0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)   
      
   The code for ROT and 2SWAP does not use auto-vectorization, and the   
   code for 2! uses auto-vectorization in a way that reduces the   
   instruction count:   
      
   -O3 (auto-vectorized)     -O3 -fno-tree-vectorize   
   add    $0x8,%rbx          add $0x8,%rbx   
   movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax   
   add    $0x18,%r13         mov 0x8(%r13),%rdx   
   movhps -0x8(%r13),%xmm0   add $0x18,%r13   
   movups %xmm0,(%r8)        mov %rdx,(%r8)   
   mov    0x0(%r13),%r8      mov %rax,0x8(%r8)   
   mov    (%rbx),%rax        mov 0x0(%r13),%r8   
   jmp    *%rax              mov (%rbx),%rax   
                             jmp *%rax   
      
   And the common tail with all these move instructions is gone.   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
   EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]