... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,627 of 117,927
Krishna Myneni to Krishna Myneni
Re: Implementing DOES>: How not to do it
14 Jul 24 14:28:33
   From: krishna.myneni@ccreweb.org   
      
   On 7/14/24 13:32, Krishna Myneni wrote:   
   > On 7/14/24 07:20, Krishna Myneni wrote:   
   >> On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:   
   >>> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,   
   >>> Anton Ertl  wrote:   
   >>>    
   >>>>   
   >>>> In any case, if you are a system implementor, you may want to check   
   >>>> your DOES> implementation with a microbenchmark that stores into the   
   >>>> does-defined word in a case where that word is not inlined.   
   >>>   
   >>> Is that equally valid for indirect threaded code?   
   >>> In indirect threaded code the instruction and data cache   
   >>> are more separated, e.g. in a simple Forth all the low level   
   >>> code could fit in the I-cache, if I'm not mistaken.   
   >>>   
   >>   
   >>   
   >> Let's check. In kForth-64, an indirect threaded code system,   
   >>   
   >> .s   
   >>    
   >>   ok   
   >> f.s   
   >> fs:    
   >>   ok   
   >> ms@ b4 ms@ swap - .   
   >> 4274  ok   
   >> ms@ b5 ms@ swap - .   
   >> 3648  ok   
   >>   
   >> So b5 appears to be more efficient that b4 ( the version with DOES> ).   
   >>   
   >> --   
   >> Krishna   
   >>   
   >> === begin code ===   
   >> 50000000 constant iterations   
   >>   
   >> : faccum  create 1 floats allot? 0.0e f!   
   >>      does> dup f@ f+ fdup f! ;   
   >>   
   >> : faccum-part2 ( F: r1 -- r2 ) ( a -- )   
   >>      dup f@ f+ fdup f! ;   
   >>   
   >> faccum x4  2.0e x4 fdrop   
   >> faccum y4 -4.0e y4 fdrop   
   >>   
   >> : b4 0.0e iterations 0 do x4 y4 loop ;   
   >> : b5 0.0e iterations 0 do   
   >>      [ ' x4 >body ] literal faccum-part2   
   >>      [ ' y4 >body ] literal faccum-part2   
   >>    loop ;   
   >> === end code ===   
   >>   
   >>   
   >>   
   >>   
   >   
   > Using perf to obtain the microbenchmarks for B4 and B5,   
   >   
   > B4   
   >   
   > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   > -e "include does-microbench.4th b4 f. cr bye"   
   > -inf   
   > Goodbye.   
   >   
   >   Performance counter stats for 'kforth64 -e include does-microbench.4th   
   > b4 f. cr bye':   
   >   
   >         14_381_951_937      cycles:u   
   >         26_206_810_946      instructions:u     #    1.82    
   insn per cycle   
   >               58_563        L1-dcache-load-misses:u   
   >               14_742        L1-icache-load-misses:u   
   >           100_122_231       branch-misses:u   
   >   
   >         4.501011307 seconds time elapsed   
   >   
   >         4.477172000 seconds user   
   >         0.003967000 seconds sys   
   >   
   >   
   > B5   
   >   
   > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   > -e "include does-microbench.4th b5 f. cr bye"   
   > -inf   
   > Goodbye.   
   >   
   >   Performance counter stats for 'kforth64 -e include does-microbench.4th   
   > b5 f. cr bye':   
   >   
   >         11_529_644_734      cycles:u   
   >         18_906_809_683      instructions:u      #      
   1.64  insn per cycle   
   >               59_605        L1-dcache-load-misses:u   
   >               21_531        L1-icache-load-misses:u   
   >           100_109_360       branch-misses:u   
   >   
   >         3.616353010 seconds time elapsed   
   >   
   >         3.600206000 seconds user   
   >         0.004639000 seconds sys   
   >   
   >   
   > It appears that the cache misses are fairly small for both b4 and b5,   
   > but the branch misses are very high in my system.   
   >   
      
      
   The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.   
   On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch   
   misses were quite few.   
      
   B4   
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include faccum.4th b4 f. cr bye"   
   0   
   Goodbye.   
      
     Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr   
   bye':   
      
            7_847_499_582      cycles:u   
      
           26_206_205_780      instructions:u     #    3.34  insn per cycle   
      
                 67_785        L1-dcache-load-misses:u   
      
                 65_391        L1-icache-load-misses:u   
      
                 38_308        branch-misses:u   
      
      
           2.014078890 seconds time elapsed   
      
           2.010013000 seconds user   
           0.000999000 seconds sys   
      
   B5   
   $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64   
   -e "include faccum.4th b5 f. cr bye"   
   0   
   Goodbye.   
      
     Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr   
   bye':   
      
            5_314_718_609      cycles:u   
      
           18_906_206_178      instructions:u     #    3.56  insn per cycle   
      
                 64_150        L1-dcache-load-misses:u   
      
                 44_818        L1-icache-load-misses:u   
      
                 29_941        branch-misses:u   
      
      
           1.372367863 seconds time elapsed   
      
           1.367289000 seconds user   
           0.002989000 seconds sys   
      
      
   The efficiency difference is due entirely to the number of instructions   
   being executed for B4 and B5.   
      
   --   
   KM   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]