Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.forth    |    Forth programmers eat a lot of Bratwurst    |    117,927 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 116,613 of 117,927    |
|    Anton Ertl to All    |
|    Implementing DOES>: How not to do it (an    |
|    11 Jul 24 14:06:02    |
      From: anton@mips.complang.tuwien.ac.at              At least one Forth system implements DOES> inefficiently, but I       suspect that it's not alone in that. I have reported the issue to the       system vendor, and hopefully they will fix it. Here I discuss the       issue so that possibly other system implementors can avoid these       pitfalls. I do not name the system here to protect the guilty, but       make no additional effort at anonymizing it.              Here's a microbenchmark that puts a spotlight on the issue. I have       started investingating the issue because the system performed badly on       an application benchmark that calls DOES>-defined words a lot, so this       microbenchmark is not just contrived.              50000000 constant iterations              : d1 create 0 , does> ;       : d1-part2 ;              d1 x1       d1 y1              : b1 iterations 0 do x1 dup ! y1 dup ! loop ;       : c1 iterations 0 do x1 drop y1 drop loop ;              : b3        iterations 0 do        [ ' x1 >body ] literal d1-part2 dup !        [ ' y1 >body ] literal d1-part2 dup !        loop ;       : c3        iterations 0 do        [ ' x1 >body ] literal d1-part2 drop        [ ' y1 >body ] literal d1-part2 drop        loop ;              B1 and C1 use the DOES>-defined words in the way the system implements       them, while B3 and C3 use them in the way the system should implement       them (when COMPILE,ing the xt of X1 and Y1): Put the address of the       body on the data stack as a literal and then call the code behind       DOES> (or in this case a colon definition that contains the code).              Let's see the performance (all numbers from a Tiger Lake), first for       C3/C1:              perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e       L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"                      Performance counter stats for 'sf64 include does-microbench.fs c3 bye':               402963130 cycles:u        804105655 instructions:u # 2.00 insn per cycle        83766 L1-dcache-load-misses        1750283 L1-icache-load-misses        36403 branch-misses              0.114868603 seconds time elapsed              The code for the X1 part of the loop here is:              44EB26 -8 [RBP] RBP LEA 488D6DF8       44EB2A RBX 0 [RBP] MOV 48895D00       44EB2E 44E970 # EBX MOV BB70E94400       44EB33 44E94D ( d1-part2 ) CALL E815FEFFFF       44EB38 0 [RBP] RBX MOV 488B5D00       44EB3C 8 [RBP] RBP LEA 488D6D08              and the DS1-PART is:              44E94D RET C3              This loop takes 8 cycles per iteration (about half of that for each       simulated DOES>-defined word). Now for C1:              C1:              3502930384 cycles:u        903847649 instructions:u # 0.26 insn per cycle        93579 L1-dcache-load-misses        1813286 L1-icache-load-misses        100033846 branch-misses              0.846190766 seconds time elapsed              This loop takes 70 cycles per iteration (i.e., almost 9 times slower)       and has one branch misprediction per DOES>-defined word. Let's see       why:              The code in the loop for the X1 part is:              44EA42 44E96B ( x1 ) CALL E824FFFFFF       44EA47 0 [RBP] RBX MOV 488B5D00       44EA4B 8 [RBP] RBP LEA 488D6D08              It calls X1:              44E96B 44E927 ( d1 +1C ) CALL E8B7FFFFFF              which in turn calls this code:              44E927 -8 [RBP] RBP LEA 488D6DF8       44E92B RBX 0 [RBP] MOV 48895D00       44E92F RBX POP 5B       44E930 RET C3              Here the return address of the call at 44E96B is popped and used as       body address for X1. However, the hardware return stack predicts that       the following RET returns to that return address, which is a       misprediction, because the RET actually returns to the return address       of the outer call (44EA47). You can see here that mispredictions are       expensive.              Let's turn to B1. The code for the X1 part of the loop is:              44E9CE 44E96B ( x1 ) CALL E898FFFFFF       44E9D3 -8 [RBP] RBP LEA 488D6DF8       44E9D7 RBX 0 [RBP] MOV 48895D00       44E9DB 0 [RBP] RAX MOV 488B4500       44E9DF RAX 0 [RBX] MOV 488903       44E9E2 8 [RBP] RBX MOV 488B5D08       44E9E6 10 [RBP] RBP LEA 488D6D10              The results are:               44758764805 cycles:u        1303768694 instructions:u # 0.03 insn per cycle        150154275 L1-dcache-load-misses        202142121 L1-icache-load-misses        100051173 branch-misses              10.699859443 seconds time elapsed              10.696626000 seconds user        0.004000000 seconds sys              So in addition to the branch mispredictions that we also saw in C1, we       see 3 D-cache misses per iteration and 4 I-cache misses per iteration,       resulting in 895 cycles/iteration. A part of this cache-ping-pong is       because the call at 44E96B is in the same cache line as the stored-to       data at 44E970. The store wants the cache line in the D-cache, while       the call to 44E96B wants it in the I-cache. This is an issue I have       pointed out here for the first time in 1995, and Forth systems still       have not fixed it completely 29 years later.              Let's see if the B3 variant escapes this problem:              0.106679000 seconds user       0.008206000 seconds sys       20590473606 cycles:u        1204106872 instructions:u # 0.06 insn per cycle        50145367 L1-dcache-load-misses        101982668 L1-icache-load-misses        49127 branch-misses              4.932825183 seconds time elapsed              4.926391000 seconds user       0.003998000 seconds sys              It's better, but there is still one D-cache miss and 2 I-cache misses       per iteration. It seems that the distance between the D1-PART2 code       and X1 data is not enough to completely avoid the issue (but this is       just guessing).              In any case, having the call right before the data and executing it on       every call to a does>-defined word is responsible for 2 I-cache misses       and 2 D-cache misses per iteration, so it should be avoided by       generating code like in C3 and B3.              For dealing with the remaining cache consistency problems, most Forth       systems have chosen to put padding between code and data, and increase       the padding in response to slowness reports. Another alternative is       to put the code in a separate memory region than the data.              - anton       --       M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html       comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html        New standard: https://forth-standard.org/        EuroForth 2024: https://euro.theforth.net              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca