... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,303 of 117,927
Anton Ertl to dxf
Re: Summary: Forth systems where do/?do
12 Mar 24 11:41:15
   From: anton@mips.complang.tuwien.ac.at   
      
   dxf  writes:   
   >On 10/03/2024 5:09 am, Anton Ertl wrote:   
   >It's difficult to imagine under   
   >what circumstances a loop address on the stack is faster, but it suggests   
   >one is starting from an inefficient or compromised base.   
      
   The starting point is gforth-fast from June 2023.  Here's an example.   
   The inner loop of the siev benchmark is:   
      
   0 i c! dup +loop   
      
   The following shows the threaded code intermixed with the native code:   
      
             loop-back address in ...   
   ... threaded code         ... return stack   
   lit    1->2               lit    1->2   
   #0                        #0   
     mov     r15,[r14]         mov     r15,[r14]   
     add     r14,$10           add     r14,$10   
   i    2->3                 i    2->3   
     mov     r9,[rbx]          mov     r9,[rbx]   
     add     r14,$08           add     r14,$08   
   c!    3->1                c!    3->1   
     mov     [r9],r15lb        mov     [r9],r15lb   
     add     r14,$08           add     r14,$08   
   dup    1->2               dup    1->2   
     mov     r15,r8            mov     r15,r8   
     add     r14,$08           add     r14,$08   
   (+loop)    2->1           (+loop)-rstack    2->1   
      
     mov     rax,[rbx]         mov     rdx,[rbx]   
     mov     rsi,[r14]         mov     rsi,$10[rbx]   
     lea     r10,$08[r14]      mov     rax,rdx   
     mov     rdx,rax           sub     rax,$08[rbx]   
     sub     rdx,$08[rbx]      add     rdx,r15   
     add     rax,r15           lea     rcx,[r15][rax]   
     lea     rcx,[r15][rdx]    xor     rcx,rax   
     xor     rcx,rdx           xor     rax,r15   
     xor     rdx,r15           test    rcx,rax   
     test    rcx,rdx           js      $7F22DC4C075F   
     js      $7F860CE101F1     mov     r14,rsi   
     mov     [rbx],rax         mov     [rbx],rdx   
     mov     rcx,[rsi]         add     r14,$08   
     lea     r14,$08[rsi]      mov     rcx,-$08[r14]   
     jmp     ecx               jmp     ecx   
      
   On Zen3 (Ryzen 5800X) and Tiger Lake (Core i5-1135G7) the return stack   
   variant is faster by a factor >2; we also see speedups on other   
   processors, but they are smaller.  Where do these speedups come from?   
      
   If you look at the updates to r14, which contains the virtual-machine   
   instruction pointer updates, they are as follows:   
      
             loop-back address in ...   
   ... threaded code         ... return stack   
     add     r14,$10           add     r14,$10   
     add     r14,$08           add     r14,$08   
     add     r14,$08           add     r14,$08   
     add     r14,$08           add     r14,$08   
     mov     rsi,[r14]         mov     rsi,$10[rbx]   
     lea     r14,$08[rsi]      mov     r14,rsi   
                               add     r14,$08   
      
   The crucial difference is that in the left column there is an unbroken   
   dependence chain from the r14 at the end of the previous iteration to   
   the r14 at the end of the present iteration; this dependence chain has   
   a latency of 9 cycles per iteration on Zen3, meaning that, with enough   
   iterations, the loop takes at least 9 cycles.   
      
   In the right column r14 at the end of one iteration does not depend on   
   r14 at the end of the previous iteration, because the dependence chain   
   starts from the instruction "mov rsi,$10[rbx]".  This means that the   
   loop can be executed faster and on Zen3 and on Tiger Lake, that   
   speedup happens to be more than a factor of 2.   
      
   - anton   
   --   
   M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html   
   comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html   
        New standard: https://forth-standard.org/   
      EuroForth 2023: https://euro.theforth.net/2023   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]