Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.forth    |    Forth programmers eat a lot of Bratwurst    |    117,927 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 116,608 of 117,927    |
|    Anton Ertl to Anton Ertl    |
|    Re: Revisiting Gilbreath's sieve    |
|    09 Jul 24 10:12:40    |
      From: anton@mips.complang.tuwien.ac.at              anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:       >So sieve.fs works slightly better on VFX64. Let's see if I can do better:       >       >DECIMAL       >CREATE FLAGS 8190 ALLOT       >variable eflag       >       >: PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS       > DO I C@       > IF DUP I + DUP EFLAG @ <       > IF EFLAG @ SWAP       > begin ( prime limit index )       > 0 over c! 2 pick + 2dup u<=       > until       > 2drop       > ELSE DROP THEN SWAP 1+ SWAP       > THEN 2 +       > LOOP DROP ;       >       >The part about EFLAGS being a variable is the main change between       >sieve.fs and siev.fs and has nothing to do with the inner loop. The       >inner loop is now decompiled as:       >       >MOV Byte 0 [RBX], # 00       >ADD RBX, [RBP+08]       >CMP RBX, [RBP]       >JB/NAE 0050C3C4       >       >That's smaller by one instruction, but that probably does not help on       >most modern CPUs which can do 5 instructions/cycle, but only one taken       >branch per cycle.              By unrolling the loop by a factor of 2 we can convert every second       branch in the inner loop into a not-taken branch:              : PRIMES ( -- n ) FLAGS 8190 1 FILL 0 3 EFLAG @ FLAGS        DO I C@        IF DUP I + DUP EFLAG @ <        IF EFLAG @ SWAP        begin ( prime limit index )        0 over c! 2 pick + 2dup u> while        0 over c! 2 pick + 2dup u<= until then        2drop        ELSE DROP THEN SWAP 1+ SWAP        THEN 2 +        LOOP DROP ;              >Invocations:       >       >perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop       loop ; bench bye"       >perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 +       eflag ! : bench 10000 0 do primes drop loop ; bench bye"       >perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do       primes drop loop ; bench bye"              VFX64 Results (on a Rocket Lake):       Cycles Instructions Program       907282940 1956567063 gilbreath-works.4th       689872097 1837620737 siev.fs       724491376 1579623569 sieve-ae.fs before unrolling       633416709 1579649480 sieve-ae.fs with unrolling              development gforth-fast results on a Rocket Lake:        Cycles Instructions Program       1660796257 5123079957 gilbreath-works.4th       1465616414 4020741453 siev.fs       1357163016 4313608319 sieve-ae.fs before unrolling       1258628834 4465785819 sieve-ae.fs with unrolling              So, with unrolling by a factor of 2 the new code is faster than the       +LOOP-based code on both VFX64 and gforth-fast.              - anton       --       M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html       comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html        New standard: https://forth-standard.org/        EuroForth 2024: https://euro.theforth.net              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca