home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.forth      Forth programmers eat a lot of Bratwurst      117,927 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 116,608 of 117,927   
   Anton Ertl to Anton Ertl   
   Re: Revisiting Gilbreath's sieve   
   09 Jul 24 10:12:40   
   
   From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >So sieve.fs works slightly better on VFX64.  Let's see if I can do better:   
   >   
   >DECIMAL   
   >CREATE FLAGS 8190 ALLOT   
   >variable eflag   
   >   
   >: PRIMES  ( -- n )  FLAGS 8190 1 FILL  0 3  EFLAG @ FLAGS   
   >  DO   I C@   
   >       IF  DUP I + DUP EFLAG @ <   
   >           IF    EFLAG @ SWAP   
   >               begin ( prime limit index )   
   >                   0 over c! 2 pick + 2dup u<=   
   >               until   
   >               2drop   
   >           ELSE  DROP  THEN  SWAP 1+ SWAP   
   >           THEN  2 +   
   >       LOOP  DROP ;   
   >   
   >The part about EFLAGS being a variable is the main change between   
   >sieve.fs and siev.fs and has nothing to do with the inner loop.  The   
   >inner loop is now decompiled as:   
   >   
   >MOV     Byte 0 [RBX], # 00   
   >ADD     RBX, [RBP+08]   
   >CMP     RBX, [RBP]   
   >JB/NAE  0050C3C4   
   >   
   >That's smaller by one instruction, but that probably does not help on   
   >most modern CPUs which can do 5 instructions/cycle, but only one taken   
   >branch per cycle.   
      
   By unrolling the loop by a factor of 2 we can convert every second   
   branch in the inner loop into a not-taken branch:   
      
   : PRIMES  ( -- n )  FLAGS 8190 1 FILL  0 3  EFLAG @ FLAGS   
     DO   I C@   
          IF  DUP I + DUP EFLAG @ <   
              IF    EFLAG @ SWAP   
                  begin ( prime limit index )   
                      0 over c! 2 pick + 2dup u> while   
                          0 over c! 2 pick + 2dup u<= until then   
                  2drop   
              ELSE  DROP  THEN  SWAP 1+ SWAP   
              THEN  2 +   
          LOOP  DROP ;   
      
   >Invocations:   
   >   
   >perf stat vfx64 "include gilbreath-works.4th : bench 10000 0 do do-prime drop   
   loop ; bench bye"   
   >perf stat vfx64 "include /nfs/nfstmp/anton/gforth-amd64/siev.fs flags 8190 +   
   eflag ! : bench 10000 0 do primes drop loop ; bench bye"   
   >perf stat vfx64 "include sieve-ae.fs flags 8190 + eflag ! : bench 10000 0 do   
   primes drop loop ; bench bye"   
      
   VFX64 Results (on a Rocket Lake):   
   Cycles    Instructions   Program   
   907282940  1956567063    gilbreath-works.4th   
   689872097  1837620737    siev.fs   
   724491376  1579623569    sieve-ae.fs before unrolling   
   633416709  1579649480    sieve-ae.fs with unrolling   
      
   development gforth-fast results on a Rocket Lake:   
     Cycles   Instructions   Program   
   1660796257  5123079957    gilbreath-works.4th   
   1465616414  4020741453    siev.fs   
   1357163016  4313608319    sieve-ae.fs before unrolling   
   1258628834  4465785819    sieve-ae.fs with unrolling   
      
   So, with unrolling by a factor of 2 the new code is faster than the   
   +LOOP-based code on both VFX64 and gforth-fast.   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
      EuroForth 2024: https://euro.theforth.net   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca