... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 117,768 of 117,927
Anton Ertl to All
Optimizing #S
23 Nov 25 09:36:31
   From: anton@mips.complang.tuwien.ac.at   
      
   I recently improved #S with a separate loop when the high call of the   
   input number is 0:   
      
   : #s      ( ud -- 0 0 ) \ core  number-sign-s   
       dup if   
           begin   
               #   
           dup 0= until   
       then   
       drop begin   
           base @ u/mod swap digit hold   
       dup 0= until   
       0 ;   
      
   This gives a nice speedup for fillseq.4th   
   <2025Nov22.185430@mips.complang.tuwien.ac.at>.  I have now   
   special-cased the second loop for base #10:   
      
   : #s      ( ud -- 0 0 ) \ core	number-sign-s   
       \G Used between @code{<<#} and @code{#>}.  Prepend all digits of   
       \G @var{ud} to the pictured numeric output string.  @code{#s} will   
       \G convert at least one digit. Therefore, if @var{ud} is 0,   
       \G @code{#s} will prepend a ``0'' to the pictured numeric output   
       \G string.   
       dup if   
           begin   
               #   
           dup 0= until   
       then   
       drop   
       base @ #10 = if   
           begin   
               #10 u/mod swap '0' + hold   
           dup 0= until   
       else   
           begin   
               base @ u/mod swap digit hold   
           dup 0= until   
       then   
       0 ;   
      
   This provides another nice speedup (see below).   
      
   I have also tried using a special primitive #10u/mod, but on   
   Rocketlake it caused a slowdown.  Gcc selected code that used   
   multiplication instead of division and replaced the mod part not with   
   multiplication and subtraction, but with several instructions, so the   
   end result consumes more instructions.  And on CPUs like Rocket Lake   
   with fast division, it also consumes more cycles.  Given that recent   
   AMD CPUs also have fast division, I removed #10u/mod again.  My guess   
   is that gcc generated this code for Skylake and earlier Intel CPUs   
   where division was slow.   
      
     old #S      #S opt1       #S opt2       worse   
   one loop      two loops    + #10 loop   + #10u/mod   
    3245_981222  2690_088360  2422_977895  2492_586635 cycles   
   11679_661274  9813_132978  8564_869788  8909_131947 instructions   
    1391_034028  1204_585688  1086_707686  1086_667791 branches   
       1_521428     1_520834     1_516859     1_515857 branch-misses   
            0.4          3.3          0.4          0.4 % tma_backend_bound   
            3.9          3.9          3.5          3.5 % tma_bad_speculation   
           24.6         19.5         25.4         25.8 % tma_frontend_bound   
           71.1         73.3         70.7         70.4 % tma_retiring   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
   EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html   
   EuroForth 2025 registration: https://euro.theforth.net/   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]