Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.forth    |    Forth programmers eat a lot of Bratwurst    |    117,927 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 117,768 of 117,927    |
|    Anton Ertl to All    |
|    Optimizing #S    |
|    23 Nov 25 09:36:31    |
   
   From: anton@mips.complang.tuwien.ac.at   
      
   I recently improved #S with a separate loop when the high call of the   
   input number is 0:   
      
   : #s ( ud -- 0 0 ) \ core number-sign-s   
    dup if   
    begin   
    #   
    dup 0= until   
    then   
    drop begin   
    base @ u/mod swap digit hold   
    dup 0= until   
    0 ;   
      
   This gives a nice speedup for fillseq.4th   
   <2025Nov22.185430@mips.complang.tuwien.ac.at>. I have now   
   special-cased the second loop for base #10:   
      
   : #s ( ud -- 0 0 ) \ core number-sign-s   
    \G Used between @code{<<#} and @code{#>}. Prepend all digits of   
    \G @var{ud} to the pictured numeric output string. @code{#s} will   
    \G convert at least one digit. Therefore, if @var{ud} is 0,   
    \G @code{#s} will prepend a ``0'' to the pictured numeric output   
    \G string.   
    dup if   
    begin   
    #   
    dup 0= until   
    then   
    drop   
    base @ #10 = if   
    begin   
    #10 u/mod swap '0' + hold   
    dup 0= until   
    else   
    begin   
    base @ u/mod swap digit hold   
    dup 0= until   
    then   
    0 ;   
      
   This provides another nice speedup (see below).   
      
   I have also tried using a special primitive #10u/mod, but on   
   Rocketlake it caused a slowdown. Gcc selected code that used   
   multiplication instead of division and replaced the mod part not with   
   multiplication and subtraction, but with several instructions, so the   
   end result consumes more instructions. And on CPUs like Rocket Lake   
   with fast division, it also consumes more cycles. Given that recent   
   AMD CPUs also have fast division, I removed #10u/mod again. My guess   
   is that gcc generated this code for Skylake and earlier Intel CPUs   
   where division was slow.   
      
    old #S #S opt1 #S opt2 worse   
   one loop two loops + #10 loop + #10u/mod   
    3245_981222 2690_088360 2422_977895 2492_586635 cycles   
   11679_661274 9813_132978 8564_869788 8909_131947 instructions   
    1391_034028 1204_585688 1086_707686 1086_667791 branches   
    1_521428 1_520834 1_516859 1_515857 branch-misses   
    0.4 3.3 0.4 0.4 % tma_backend_bound   
    3.9 3.9 3.5 3.5 % tma_bad_speculation   
    24.6 19.5 25.4 25.8 % tma_frontend_bound   
    71.1 73.3 70.7 70.4 % tma_retiring   
      
   - anton   
   --   
   M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
    New standard: https://forth-standard.org/   
   EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html   
   EuroForth 2025 registration: https://euro.theforth.net/   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca