... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,618 of 117,927
Anton Ertl to Anton Ertl
Re: Implementing DOES>: How not to do it
13 Jul 24 15:31:38
   From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >At least one Forth system implements DOES> inefficiently, but I   
   >suspect that it's not alone in that.   
      
   And indeed, a second system has the same problem; it shows up more   
   rarely, because normally this system inlines does>-defined words, but   
   when it does not, it performs badly.   
      
   Here's a microbenchmark where the second system does not inline the   
   does-defined word:   
      
   50000000 constant iterations   
   : faccum   
       create 0e f,   
     does> ( r1 -- r2 )   
       dup f@ f+ fdup f! ;   
      
   : faccum-part2 ( r1 addr -- r2 )   
       dup f@ f+ fdup f! ;   
      
   faccum x4 \ 2e x4 fdrop   
   faccum y4 \ -4e y4 fdrop   
      
   : b4 0e iterations 0 do x4 y4 loop ;   
   : b5 0e iterations 0 do   
           [ ' x4 >body ] literal faccum-part2   
           [ ' y4 >body ] literal faccum-part2   
        loop ;   
      
      
   First, let's see what the Forth systems do by themselves (the B4   
   microbenchmark); numbers from a Skylake; I have replaced the names of   
   the Forth systems with inefficient DOES> implementations with A and B.   
      
   [~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses    
   /gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"   
   0.   
      
    Performance counter stats for '/home/anton/gforth/gforth-fast -e include   
   does-microbench.fs b4 f. cr bye':   
      
          948_628_907      cycles:u   
        3_695_796_028      instructions:u            #    3.90  insn per cycle   
            1_154_670      L1-dcache-load-misses   
              198_627      L1-icache-load-misses   
              306_507      branch-misses   
      
          0.245984689 seconds time elapsed   
      
          0.244894000 seconds user   
          0.000000000 seconds sys   
      
      
   [~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include   
   does-microbench.fs b4 f. cr bye"   
   0.00000000   
      
      
    Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':   
      
       38_769_505_700      cycles:u   
        1_704_476_397      instructions:u            #    0.04  insn per cycle   
          178_288_238      L1-dcache-load-misses   
          250_454_606      L1-icache-load-misses   
          100_090_310      branch-misses   
      
          9.719803719 seconds time elapsed   
      
          9.715343000 seconds user   
          0.000000000 seconds sys   
      
      
   [~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include   
   does-microbench.fs b4 f. cr bye"   
      
   Including does-microbench.fs0.   
      
      
    Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':   
      
       39_200_313_445      cycles:u   
        1_413_936_888      instructions:u            #    0.04  insn per cycle   
          150_445_572      L1-dcache-load-misses   
          209_127_540      L1-icache-load-misses   
          100_128_427      branch-misses   
      
          9.822342252 seconds time elapsed   
      
          9.817016000 seconds user   
          0.000000000 seconds sys   
      
   So both A and B fall into the cache-ping-pong and the return stack   
   misprediction pitfalls in this case, resulting in a factor 40 slowdown   
   compared to Gforth.   
      
   Let's see how it works if we use the code I suggest (simulated in B5):   
      
   [~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses    
   /gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"   
   0.   
      
    Performance counter stats for '/home/anton/gforth/gforth-fast -e include   
   does-microbench.fs b5 f. cr bye':   
      
          943_277_009      cycles:u   
        3_295_795_332      instructions:u            #    3.49  insn per cycle   
            1_147_107      L1-dcache-load-misses   
              198_364      L1-icache-load-misses   
              295_186      branch-misses   
      
          0.247765182 seconds time elapsed   
      
          0.242645000 seconds user   
          0.004044000 seconds sys   
      
      
   [~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include   
   does-microbench.fs b5 f. cr bye"   
   0.00000000   
      
      
    Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':   
      
       23_587_381_659      cycles:u   
        1_604_475_561      instructions:u            #    0.07  insn per cycle   
          100_111_296      L1-dcache-load-misses   
          100_502_420      L1-icache-load-misses   
               77_126      branch-misses   
      
          6.055177414 seconds time elapsed   
      
          6.055288000 seconds user   
          0.000000000 seconds sys   
      
      
   [~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e   
   L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include   
   does-microbench.fs b5 f. cr bye"   
      
   Including does-microbench.fs0.   
      
    Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':   
      
          949_044_323      cycles:u   
        1_313_933_897      instructions:u            #    1.38  insn per cycle   
              246_252      L1-dcache-load-misses   
              105_517      L1-icache-load-misses   
               61_449      branch-misses   
      
          0.239750023 seconds time elapsed   
      
          0.239811000 seconds user   
          0.000000000 seconds sys   
      
   This solves both problems for B, but A still suffers from   
   cache ping-pong; I suspect that this is because there is not enough   
   distance between the modified data and FACCUM-PART2 (or, less likely,   
   not enough distance between the modified data and the loop in B5).   
      
   In any case, if you are a system implementor, you may want to check   
   your DOES> implementation with a microbenchmark that stores into the   
   does-defined word in a case where that word is not inlined.   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
      EuroForth 2024: https://euro.theforth.net   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]