Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.forth    |    Forth programmers eat a lot of Bratwurst    |    117,927 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 116,618 of 117,927    |
|    Anton Ertl to Anton Ertl    |
|    Re: Implementing DOES>: How not to do it    |
|    13 Jul 24 15:31:38    |
      From: anton@mips.complang.tuwien.ac.at              anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:       >At least one Forth system implements DOES> inefficiently, but I       >suspect that it's not alone in that.              And indeed, a second system has the same problem; it shows up more       rarely, because normally this system inlines does>-defined words, but       when it does not, it performs badly.              Here's a microbenchmark where the second system does not inline the       does-defined word:              50000000 constant iterations       : faccum        create 0e f,        does> ( r1 -- r2 )        dup f@ f+ fdup f! ;              : faccum-part2 ( r1 addr -- r2 )        dup f@ f+ fdup f! ;              faccum x4 \ 2e x4 fdrop       faccum y4 \ -4e y4 fdrop              : b4 0e iterations 0 do x4 y4 loop ;       : b5 0e iterations 0 do        [ ' x4 >body ] literal faccum-part2        [ ' y4 >body ] literal faccum-part2        loop ;                     First, let's see what the Forth systems do by themselves (the B4       microbenchmark); numbers from a Skylake; I have replaced the names of       the Forth systems with inefficient DOES> implementations with A and B.              [~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses        /gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"       0.               Performance counter stats for '/home/anton/gforth/gforth-fast -e include       does-microbench.fs b4 f. cr bye':               948_628_907 cycles:u        3_695_796_028 instructions:u # 3.90 insn per cycle        1_154_670 L1-dcache-load-misses        198_627 L1-icache-load-misses        306_507 branch-misses               0.245984689 seconds time elapsed               0.244894000 seconds user        0.000000000 seconds sys                     [~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include       does-microbench.fs b4 f. cr bye"       0.00000000                      Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':               38_769_505_700 cycles:u        1_704_476_397 instructions:u # 0.04 insn per cycle        178_288_238 L1-dcache-load-misses        250_454_606 L1-icache-load-misses        100_090_310 branch-misses               9.719803719 seconds time elapsed               9.715343000 seconds user        0.000000000 seconds sys                     [~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include       does-microbench.fs b4 f. cr bye"              Including does-microbench.fs0.                      Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':               39_200_313_445 cycles:u        1_413_936_888 instructions:u # 0.04 insn per cycle        150_445_572 L1-dcache-load-misses        209_127_540 L1-icache-load-misses        100_128_427 branch-misses               9.822342252 seconds time elapsed               9.817016000 seconds user        0.000000000 seconds sys              So both A and B fall into the cache-ping-pong and the return stack       misprediction pitfalls in this case, resulting in a factor 40 slowdown       compared to Gforth.              Let's see how it works if we use the code I suggest (simulated in B5):              [~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses        /gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"       0.               Performance counter stats for '/home/anton/gforth/gforth-fast -e include       does-microbench.fs b5 f. cr bye':               943_277_009 cycles:u        3_295_795_332 instructions:u # 3.49 insn per cycle        1_147_107 L1-dcache-load-misses        198_364 L1-icache-load-misses        295_186 branch-misses               0.247765182 seconds time elapsed               0.242645000 seconds user        0.004044000 seconds sys                     [~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include       does-microbench.fs b5 f. cr bye"       0.00000000                      Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':               23_587_381_659 cycles:u        1_604_475_561 instructions:u # 0.07 insn per cycle        100_111_296 L1-dcache-load-misses        100_502_420 L1-icache-load-misses        77_126 branch-misses               6.055177414 seconds time elapsed               6.055288000 seconds user        0.000000000 seconds sys                     [~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e       L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include       does-microbench.fs b5 f. cr bye"              Including does-microbench.fs0.               Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':               949_044_323 cycles:u        1_313_933_897 instructions:u # 1.38 insn per cycle        246_252 L1-dcache-load-misses        105_517 L1-icache-load-misses        61_449 branch-misses               0.239750023 seconds time elapsed               0.239811000 seconds user        0.000000000 seconds sys              This solves both problems for B, but A still suffers from       cache ping-pong; I suspect that this is because there is not enough       distance between the modified data and FACCUM-PART2 (or, less likely,       not enough distance between the modified data and the loop in B5).              In any case, if you are a system implementor, you may want to check       your DOES> implementation with a microbenchmark that stores into the       does-defined word in a case where that word is not inlined.              - anton       --       M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html       comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html        New standard: https://forth-standard.org/        EuroForth 2024: https://euro.theforth.net              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca