From: krishna.myneni@ccreweb.org   
      
   On 7/14/24 13:32, Krishna Myneni wrote:   
   > On 7/14/24 07:20, Krishna Myneni wrote:   
   >> On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:   
   >>> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,   
   >>> Anton Ertl wrote:   
   >>>    
   >>>>   
   >>>> In any case, if you are a system implementor, you may want to check   
   >>>> your DOES> implementation with a microbenchmark that stores into the   
   >>>> does-defined word in a case where that word is not inlined.   
   >>>   
   >>> Is that equally valid for indirect threaded code?   
   >>> In indirect threaded code the instruction and data cache   
   >>> are more separated, e.g. in a simple Forth all the low level   
   >>> code could fit in the I-cache, if I'm not mistaken.   
   >>>   
   >>   
   >>   
   >> Let's check. In kForth-64, an indirect threaded code system,   
   >>   
   >> .s   
   >>    
   >> ok   
   >> f.s   
   >> fs:    
   >> ok   
   >> ms@ b4 ms@ swap - .   
   >> 4274 ok   
   >> ms@ b5 ms@ swap - .   
   >> 3648 ok   
   >>   
   >> So b5 appears to be more efficient that b4 ( the version with DOES> ).   
   >>   
   >> --   
   >> Krishna   
   >>   
   >> === begin code ===   
   >> 50000000 constant iterations   
   >>   
   >> : faccum create 1 floats allot? 0.0e f!   
   >> does> dup f@ f+ fdup f! ;   
   >>   
   >> : faccum-part2 ( F: r1 -- r2 ) ( a -- )   
   >> dup f@ f+ fdup f! ;   
   >>   
   >> faccum x4 2.0e x4 fdrop   
   >> faccum y4 -4.0e y4 fdrop   
   >>   
   >> : b4 0.0e iterations 0 do x4 y4 loop ;   
   >> : b5 0.0e iterations 0 do   
   >> [ ' x4 >body ] literal faccum-part2   
   >> [ ' y4 >body ] literal faccum-part2   
   >> loop ;   
   >> === end code ===   
   >>   
   >>   
   >>   
   >>   
   >   
   > Using perf to obtain the microbenchmarks for B4 and B5,   
   >   
   > B4   
   >   
   > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   > -e "include does-microbench.4th b4 f. cr bye"   
   > -inf   
   > Goodbye.   
   >   
   > Performance counter stats for 'kforth64 -e include does-microbench.4th   
   > b4 f. cr bye':   
   >   
   > 14_381_951_937 cycles:u   
   > 26_206_810_946 instructions:u # 1.82    
   insn per cycle   
   > 58_563 L1-dcache-load-misses:u   
   > 14_742 L1-icache-load-misses:u   
   > 100_122_231 branch-misses:u   
   >   
   > 4.501011307 seconds time elapsed   
   >   
   > 4.477172000 seconds user   
   > 0.003967000 seconds sys   
   >   
   >   
   > B5   
   >   
   > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   > -e "include does-microbench.4th b5 f. cr bye"   
   > -inf   
   > Goodbye.   
   >   
   > Performance counter stats for 'kforth64 -e include does-microbench.4th   
   > b5 f. cr bye':   
   >   
   > 11_529_644_734 cycles:u   
   > 18_906_809_683 instructions:u #    
   1.64 insn per cycle   
   > 59_605 L1-dcache-load-misses:u   
   > 21_531 L1-icache-load-misses:u   
   > 100_109_360 branch-misses:u   
   >   
   > 3.616353010 seconds time elapsed   
   >   
   > 3.600206000 seconds user   
   > 0.004639000 seconds sys   
   >   
   >   
   > It appears that the cache misses are fairly small for both b4 and b5,   
   > but the branch misses are very high in my system.   
   >   
      
      
   The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.   
   On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch   
   misses were quite few.   
      
   B4   
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include faccum.4th b4 f. cr bye"   
   0   
   Goodbye.   
      
    Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr   
   bye':   
      
    7_847_499_582 cycles:u   
      
    26_206_205_780 instructions:u # 3.34 insn per cycle   
      
    67_785 L1-dcache-load-misses:u   
      
    65_391 L1-icache-load-misses:u   
      
    38_308 branch-misses:u   
      
      
    2.014078890 seconds time elapsed   
      
    2.010013000 seconds user   
    0.000999000 seconds sys   
      
   B5   
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include faccum.4th b5 f. cr bye"   
   0   
   Goodbye.   
      
    Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr   
   bye':   
      
    5_314_718_609 cycles:u   
      
    18_906_206_178 instructions:u # 3.56 insn per cycle   
      
    64_150 L1-dcache-load-misses:u   
      
    44_818 L1-icache-load-misses:u   
      
    29_941 branch-misses:u   
      
      
    1.372367863 seconds time elapsed   
      
    1.367289000 seconds user   
    0.002989000 seconds sys   
      
      
   The efficiency difference is due entirely to the number of instructions   
   being executed for B4 and B5.   
      
   --   
   KM   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|