From: krishna.myneni@ccreweb.org   
      
   On 7/14/24 07:20, Krishna Myneni wrote:   
   > On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:   
   >> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,   
   >> Anton Ertl wrote:   
   >>    
   >>>   
   >>> In any case, if you are a system implementor, you may want to check   
   >>> your DOES> implementation with a microbenchmark that stores into the   
   >>> does-defined word in a case where that word is not inlined.   
   >>   
   >> Is that equally valid for indirect threaded code?   
   >> In indirect threaded code the instruction and data cache   
   >> are more separated, e.g. in a simple Forth all the low level   
   >> code could fit in the I-cache, if I'm not mistaken.   
   >>   
   >   
   >   
   > Let's check. In kForth-64, an indirect threaded code system,   
   >   
   > .s   
   >    
   > ok   
   > f.s   
   > fs:    
   > ok   
   > ms@ b4 ms@ swap - .   
   > 4274 ok   
   > ms@ b5 ms@ swap - .   
   > 3648 ok   
   >   
   > So b5 appears to be more efficient that b4 ( the version with DOES> ).   
   >   
   > --   
   > Krishna   
   >   
   > === begin code ===   
   > 50000000 constant iterations   
   >   
   > : faccum create 1 floats allot? 0.0e f!   
   > does> dup f@ f+ fdup f! ;   
   >   
   > : faccum-part2 ( F: r1 -- r2 ) ( a -- )   
   > dup f@ f+ fdup f! ;   
   >   
   > faccum x4 2.0e x4 fdrop   
   > faccum y4 -4.0e y4 fdrop   
   >   
   > : b4 0.0e iterations 0 do x4 y4 loop ;   
   > : b5 0.0e iterations 0 do   
   > [ ' x4 >body ] literal faccum-part2   
   > [ ' y4 >body ] literal faccum-part2   
   > loop ;   
   > === end code ===   
   >   
   >   
   >   
   >   
      
   Using perf to obtain the microbenchmarks for B4 and B5,   
      
   B4   
      
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include does-microbench.4th b4 f. cr bye"   
   -inf   
   Goodbye.   
      
    Performance counter stats for 'kforth64 -e include does-microbench.4th   
   b4 f. cr bye':   
      
    14_381_951_937 cycles:u   
      
    26_206_810_946 instructions:u # 1.82 insn per cycle   
      
    58_563 L1-dcache-load-misses:u   
      
    14_742 L1-icache-load-misses:u   
      
    100_122_231 branch-misses:u   
      
      
    4.501011307 seconds time elapsed   
      
    4.477172000 seconds user   
    0.003967000 seconds sys   
      
      
   B5   
      
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include does-microbench.4th b5 f. cr bye"   
   -inf   
   Goodbye.   
      
    Performance counter stats for 'kforth64 -e include does-microbench.4th   
   b5 f. cr bye':   
      
    11_529_644_734 cycles:u   
      
    18_906_809_683 instructions:u # 1.64 insn per   
   cycle   
    59_605 L1-dcache-load-misses:u   
      
    21_531 L1-icache-load-misses:u   
      
    100_109_360 branch-misses:u   
      
      
    3.616353010 seconds time elapsed   
      
    3.600206000 seconds user   
    0.004639000 seconds sys   
      
      
   It appears that the cache misses are fairly small for both b4 and b5,   
   but the branch misses are very high in my system.   
      
   --   
   Krishna   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|