... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,613 of 117,927
Anton Ertl to All
Implementing DOES>: How not to do it (an
11 Jul 24 14:06:02
   From: anton@mips.complang.tuwien.ac.at   
      
   At least one Forth system implements DOES> inefficiently, but I   
   suspect that it's not alone in that.  I have reported the issue to the   
   system vendor, and hopefully they will fix it.  Here I discuss the   
   issue so that possibly other system implementors can avoid these   
   pitfalls.  I do not name the system here to protect the guilty, but   
   make no additional effort at anonymizing it.   
      
   Here's a microbenchmark that puts a spotlight on the issue.  I have   
   started investingating the issue because the system performed badly on   
   an application benchmark that calls DOES>-defined words a lot, so this   
   microbenchmark is not just contrived.   
      
   50000000 constant iterations   
      
   : d1 create 0 , does> ;   
   : d1-part2 ;   
      
   d1 x1   
   d1 y1   
      
   : b1 iterations 0 do x1 dup ! y1 dup ! loop ;   
   : c1 iterations 0 do x1 drop  y1 drop loop ;   
      
   : b3   
       iterations 0 do   
           [ ' x1 >body ] literal d1-part2 dup !   
           [ ' y1 >body ] literal d1-part2 dup !   
       loop ;   
   : c3   
       iterations 0 do   
           [ ' x1 >body ] literal d1-part2 drop   
           [ ' y1 >body ] literal d1-part2 drop   
       loop ;   
      
   B1 and C1 use the DOES>-defined words in the way the system implements   
   them, while B3 and C3 use them in the way the system should implement   
   them (when COMPILE,ing the xt of X1 and Y1): Put the address of the   
   body on the data stack as a literal and then call the code behind   
   DOES> (or in this case a colon definition that contains the code).   
      
   Let's see the performance (all numbers from a Tiger Lake), first for   
   C3/C1:   
      
   perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e   
   L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"   
      
      
    Performance counter stats for 'sf64 include does-microbench.fs c3 bye':   
      
     402963130      cycles:u   
     804105655      instructions:u                   #    2.00  insn per cycle   
         83766      L1-dcache-load-misses   
       1750283      L1-icache-load-misses   
         36403      branch-misses   
      
   0.114868603 seconds time elapsed   
      
   The code for the X1 part of the loop here is:   
      
   44EB26   -8 [RBP] RBP LEA               488D6DF8   
   44EB2A   RBX 0 [RBP] MOV                48895D00   
   44EB2E   44E970 # EBX MOV               BB70E94400   
   44EB33   44E94D ( d1-part2 ) CALL       E815FEFFFF   
   44EB38   0 [RBP] RBX MOV                488B5D00   
   44EB3C   8 [RBP] RBP LEA                488D6D08   
      
   and the DS1-PART is:   
      
   44E94D   RET                            C3   
      
   This loop takes 8 cycles per iteration (about half of that for each   
   simulated DOES>-defined word).  Now for C1:   
      
   C1:   
      
   3502930384      cycles:u   
     903847649      instructions:u                   #    0.26  insn per cycle   
         93579      L1-dcache-load-misses   
       1813286      L1-icache-load-misses   
     100033846      branch-misses   
      
   0.846190766 seconds time elapsed   
      
   This loop takes 70 cycles per iteration (i.e., almost 9 times slower)   
   and has one branch misprediction per DOES>-defined word.  Let's see   
   why:   
      
   The code in the loop for the X1 part is:   
      
   44EA42   44E96B ( x1 ) CALL             E824FFFFFF   
   44EA47   0 [RBP] RBX MOV                488B5D00   
   44EA4B   8 [RBP] RBP LEA                488D6D08   
      
   It calls X1:   
      
   44E96B   44E927 ( d1 +1C ) CALL         E8B7FFFFFF   
      
   which in turn calls this code:   
      
   44E927   -8 [RBP] RBP LEA               488D6DF8   
   44E92B   RBX 0 [RBP] MOV                48895D00   
   44E92F   RBX POP                        5B   
   44E930   RET                            C3   
      
   Here the return address of the call at 44E96B is popped and used as   
   body address for X1.  However, the hardware return stack predicts that   
   the following RET returns to that return address, which is a   
   misprediction, because the RET actually returns to the return address   
   of the outer call (44EA47).  You can see here that mispredictions are   
   expensive.   
      
   Let's turn to B1.  The code for the X1 part of the loop is:   
      
   44E9CE   44E96B ( x1 ) CALL             E898FFFFFF   
   44E9D3   -8 [RBP] RBP LEA               488D6DF8   
   44E9D7   RBX 0 [RBP] MOV                48895D00   
   44E9DB   0 [RBP] RAX MOV                488B4500   
   44E9DF   RAX 0 [RBX] MOV                488903   
   44E9E2   8 [RBP] RBX MOV                488B5D08   
   44E9E6   10 [RBP] RBP LEA               488D6D10   
      
   The results are:   
      
    44758764805      cycles:u   
     1303768694      instructions:u                   #    0.03  insn per cycle   
      150154275      L1-dcache-load-misses   
      202142121      L1-icache-load-misses   
      100051173      branch-misses   
      
   10.699859443 seconds time elapsed   
      
   10.696626000 seconds user   
    0.004000000 seconds sys   
      
   So in addition to the branch mispredictions that we also saw in C1, we   
   see 3 D-cache misses per iteration and 4 I-cache misses per iteration,   
   resulting in 895 cycles/iteration.  A part of this cache-ping-pong is   
   because the call at 44E96B is in the same cache line as the stored-to   
   data at 44E970.  The store wants the cache line in the D-cache, while   
   the call to 44E96B wants it in the I-cache.  This is an issue I have   
   pointed out here for the first time in 1995, and Forth systems still   
   have not fixed it completely 29 years later.   
      
   Let's see if the B3 variant escapes this problem:   
      
   0.106679000 seconds user   
   0.008206000 seconds sys   
   20590473606      cycles:u   
    1204106872      instructions:u                   #    0.06  insn per cycle   
      50145367      L1-dcache-load-misses   
     101982668      L1-icache-load-misses   
         49127      branch-misses   
      
   4.932825183 seconds time elapsed   
      
   4.926391000 seconds user   
   0.003998000 seconds sys   
      
   It's better, but there is still one D-cache miss and 2 I-cache misses   
   per iteration.  It seems that the distance between the D1-PART2 code   
   and X1 data is not enough to completely avoid the issue (but this is   
   just guessing).   
      
   In any case, having the call right before the data and executing it on   
   every call to a does>-defined word is responsible for 2 I-cache misses   
   and 2 D-cache misses per iteration, so it should be avoided by   
   generating code like in C3 and B3.   
      
   For dealing with the remaining cache consistency problems, most Forth   
   systems have chosen to put padding between code and data, and increase   
   the padding in response to slowness reports.  Another alternative is   
   to put the code in a separate memory region than the data.   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
      EuroForth 2024: https://euro.theforth.net   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]