... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,422 of 117,927
Anton Ertl to Anton Ertl
Re: Floating point implementations on AM
21 Apr 24 09:12:54
   From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >OTOH, yesterday I saw what gcc did for the inner loop of the bubble   
   >benchmark from the Stanford integer benchmarks:   
   >   
   >        while ( i   
   >            if ( sortlist[i] > sortlist[i+1] ) {   
   >                j = sortlist[i];   
   >                sortlist[i] = sortlist[i+1];   
   >                sortlist[i+1] = j;   
   >                };   
   >            i=i+1;   
   >            };   
   >   
   >        top=top-1;   
   >        };   
   >   
   >gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants   
   >to use SIMD instructions:   
   >   
   >    gcc -01                      gcc -O3   
   >1c: add   $0x4,%rax          c0: movq   (%rax),%xmm0   
   >    cmp   %rsi,%rax              add    $0x1,%edx   
   >    je    35                     pshufd $0xe5,%xmm0,%xmm1   
   >25: mov   (%rax),%edx            movd   %xmm0,%edi   
   >    mov   0x4(%rax),%ecx         movd   %xmm1,%ecx   
   >    cmp   %ecx,%edx              cmp    %ecx,%edi   
   >    jle   1c                     jle    e1   
   >    mov   %ecx,(%rax)            pshufd $0xe1,%xmm0,%xmm0   
   >    mov   %edx,0x4(%rax)         movq   %xmm0,(%rax)   
   >    jmp   1c                 e1: add    $0x4,%rax   
   >35:                              cmp    %r8d,%edx   
   >                                 jl     c0   
   >   
   >The version produced by gcc -O3 is almost three times slower on a   
   >Skylake than the one by gcc -O1 and is actually slower than several   
   >Forth systems, including gforth-fast.  I think that the reason is that   
   >the movq towards the end stores two items, and the movq at the start   
   >of the next iteration loads one of these item, i.e., there is partial   
   >overlap between the store and the load.  In this case the hardware   
   >takes a slow path, which means that the slowdown is much bigger than   
   >the instruction count suggests.   
      
   I was curious if a more recent Intel core had improved on that (and   
   maybe such a more recent Intel core was targeted by the "optimization"   
   that caused the slowdown), so I measured it on a P-core of a Core   
   i3-1315U.  The results are as follows:   
      
       O1/bubble      O3/bubble   
     424,248,952  2,061,809,866      cpu_core/cycles/   
   1,536,825,253  1,986,035,580      cpu_core/instructions/   
      
   So, more than a factor of 4 on this microarchitecture.   
      
   The differences in the topdown analysis are also interesting:   
      
   O1   
    1,177,188,340 cpu_core/topdown-retiring/       # 46.1% Retiring   
      279,332,826 cpu_core/topdown-bad-spec/       # 10.9% Bad Speculation   
      778,141,445 cpu_core/topdown-fe-bound/       # 30.5% Frontend Bound   
      319,237,516 cpu_core/topdown-be-bound/       # 12.5% Backend Bound   
                0 cpu_core/topdown-heavy-ops/      #  0.0% Heavy Operations   
      269,356,654 cpu_core/topdown-br-mispredict/  # 10.5% Branch Mispredict   
      269,356,654 cpu_core/topdown-fetch-lat/      # 10.5% Fetch Latency   
       59,857,034 cpu_core/topdown-mem-bound/      #  2.3% Memory Bound   
      
   O3   
    1,599,831,263 cpu_core/topdown-retiring/       # 12.9% Retiring   
      630,236,558 cpu_core/topdown-bad-spec/       #  5.1% Bad Speculation   
      533,277,087 cpu_core/topdown-fe-bound/       #  4.3% Frontend Bound   
    9,598,987,583 cpu_core/topdown-be-bound/       # 77.6% Backend Bound   
          280,169 cpu_core/topdown-heavy-ops/      #  0.0% Heavy Operations   
      630,236,558 cpu_core/topdown-br-mispredict/  #  5.1% Branch Mispredict   
      193,918,941 cpu_core/topdown-fetch-lat/      #  1.6% Fetch Latency   
    5,623,649,291 cpu_core/topdown-mem-bound/      # 45.5% Memory Bound   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
      EuroForth 2023: https://euro.theforth.net/2023   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]