From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >OTOH, yesterday I saw what gcc did for the inner loop of the bubble   
   >benchmark from the Stanford integer benchmarks:   
   >   
   > while ( i   
   > if ( sortlist[i] > sortlist[i+1] ) {   
   > j = sortlist[i];   
   > sortlist[i] = sortlist[i+1];   
   > sortlist[i+1] = j;   
   > };   
   > i=i+1;   
   > };   
   >   
   > top=top-1;   
   > };   
   >   
   >gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants   
   >to use SIMD instructions:   
   >   
   > gcc -01 gcc -O3   
   >1c: add $0x4,%rax c0: movq (%rax),%xmm0   
   > cmp %rsi,%rax add $0x1,%edx   
   > je 35 pshufd $0xe5,%xmm0,%xmm1   
   >25: mov (%rax),%edx movd %xmm0,%edi   
   > mov 0x4(%rax),%ecx movd %xmm1,%ecx   
   > cmp %ecx,%edx cmp %ecx,%edi   
   > jle 1c jle e1   
   > mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0   
   > mov %edx,0x4(%rax) movq %xmm0,(%rax)   
   > jmp 1c e1: add $0x4,%rax   
   >35: cmp %r8d,%edx   
   > jl c0   
   >   
   >The version produced by gcc -O3 is almost three times slower on a   
   >Skylake than the one by gcc -O1 and is actually slower than several   
   >Forth systems, including gforth-fast. I think that the reason is that   
   >the movq towards the end stores two items, and the movq at the start   
   >of the next iteration loads one of these item, i.e., there is partial   
   >overlap between the store and the load. In this case the hardware   
   >takes a slow path, which means that the slowdown is much bigger than   
   >the instruction count suggests.   
      
   I was curious if a more recent Intel core had improved on that (and   
   maybe such a more recent Intel core was targeted by the "optimization"   
   that caused the slowdown), so I measured it on a P-core of a Core   
   i3-1315U. The results are as follows:   
      
    O1/bubble O3/bubble   
    424,248,952 2,061,809,866 cpu_core/cycles/   
   1,536,825,253 1,986,035,580 cpu_core/instructions/   
      
   So, more than a factor of 4 on this microarchitecture.   
      
   The differences in the topdown analysis are also interesting:   
      
   O1   
    1,177,188,340 cpu_core/topdown-retiring/ # 46.1% Retiring   
    279,332,826 cpu_core/topdown-bad-spec/ # 10.9% Bad Speculation   
    778,141,445 cpu_core/topdown-fe-bound/ # 30.5% Frontend Bound   
    319,237,516 cpu_core/topdown-be-bound/ # 12.5% Backend Bound   
    0 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations   
    269,356,654 cpu_core/topdown-br-mispredict/ # 10.5% Branch Mispredict   
    269,356,654 cpu_core/topdown-fetch-lat/ # 10.5% Fetch Latency   
    59,857,034 cpu_core/topdown-mem-bound/ # 2.3% Memory Bound   
      
   O3   
    1,599,831,263 cpu_core/topdown-retiring/ # 12.9% Retiring   
    630,236,558 cpu_core/topdown-bad-spec/ # 5.1% Bad Speculation   
    533,277,087 cpu_core/topdown-fe-bound/ # 4.3% Frontend Bound   
    9,598,987,583 cpu_core/topdown-be-bound/ # 77.6% Backend Bound   
    280,169 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations   
    630,236,558 cpu_core/topdown-br-mispredict/ # 5.1% Branch Mispredict   
    193,918,941 cpu_core/topdown-fetch-lat/ # 1.6% Fetch Latency   
    5,623,649,291 cpu_core/topdown-mem-bound/ # 45.5% Memory Bound   
      
   - anton   
   --   
   M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
    New standard: https://forth-standard.org/   
    EuroForth 2023: https://euro.theforth.net/2023   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|