... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 117,105 of 117,927
Anton Ertl to Anton Ertl
Stack caching (: Stack vs stackless oper
01 Mar 25 08:18:06
   From: anton@mips.complang.tuwien.ac.at   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >see-code exchange        see-code exchange4     see-code exchange2   
   >over    1->2             dup    1->2            dup >r     1->1   
   >  mov     r15,$08[r12]     mov     r15,r8       >r    1->1   
   >@    2->2                @    2->2                mov     -$08[r13],r8   
   >  mov     r15,[r15]        mov     r15,[r15]      sub     r13,$08   
   >swap    2->3             rot    2->3            @    1->1   
   >  add     r12,$08          mov     r9,$08[r12]    mov     r8,[r8]   
   >  mov     r9,r8            add     r12,$08      over    1->2   
   >  mov     r8,[r12]       !@    3->2               mov     r15,$08[r12]   
   >!@    3->2                 mov     rax,r15      @    2->2   
   >  mov     rax,r15          mov     r15,[r9]       mov     r15,[r15]   
   >  mov     r15,[r9]         mov     [r9],rax     r>    2->3   
   >  mov     [r9],rax       swap    2->3             mov     r9,$00[r13]   
   >swap    2->3               add     r12,$08        add     r13,$08   
   >  add     r12,$08          mov     r9,r8        !    3->1   
   >  mov     r9,r8            mov     r8,[r12]       mov     [r9],r15   
   >  mov     r8,[r12]       !    3->1              swap    1->2   
   >!    3->1                  mov     [r9],r15       mov     r15,$08[r12]   
   >  mov     [r9],r15       ;s    1->1               add     r12,$08   
   >;s    1->1                 mov     rbx,$00[r13] !    2->0   
   >  mov     rbx,$00[r13]     add     r13,$08        mov     [r15],r8   
   >  add     r13,$08          mov     rax,[rbx]    ;s    0->1   
   >  mov     rax,[rbx]        jmp     eax            mov     r8,$08[r12]   
   >  jmp     eax                                     add     r12,$08   
   >                                                  mov     rbx,$00[r13]   
   >                                                  add     r13,$08   
   >                                                  mov     rax,[rbx]   
   >                                                  jmp     eax   
      
   The difference between exchange and exchange4 shows how stack caching   
   can have a hard-to-predict effect.  Gforth searches for the shortest   
   path through the available stack-cache states, where shortness is   
   defined by the native-code length.  E.g., it starts with state 1, and   
   from there it can use any of the dup variants starting in state 1, or   
   first transition to another state and use a dup variant starting from   
   there.   
      
   For SWAP and ROT gforth-fast has the following variants:   
      
   primitive in-out #     code bytes   
   swap       1-1  132  len= 4+ 13+ 3   
   swap       2-2   37  len= 4+  9+ 3   
   swap       3-3    4  len= 4+  9+ 3   
   swap       0-2    8  len= 4+ 14+ 3   
   swap       1-2   82  len= 4+  9+ 3   
   swap       2-1   74  len= 4+  8+ 3   
   swap       2-3   30  len= 4+ 11+ 3   
   swap       3-2    3  len= 4+ 11+ 3   
   swap       2-0   20  len= 4+ 13+ 3   
   rot        1-1   46  len= 4+ 23+ 3   
   rot        3-3    6  len= 4+ 12+ 3   
   rot        3-1   24  len= 4+ 13+ 3   
   rot        2-3   15  len= 4+  9+ 3   
   rot        1-3   17  len= 4+ 17+ 3   
   rot        0-3    1  len= 4+ 19+ 3   
      
   You get these data (in a rawer form) with   
      
   gforth-fast --print-prim -e bye |& grep ^swap   
   gforth-fast --print-prim -e bye |& grep ^rot   
      
   The in column is the stack-cache state on entering the word, the out   
   column is the stack-cache state on leaving the word.  The # column   
   shows how many times this variant of the primitive is used (static   
   counts).  The code length colum shows three parts, the middle of which   
   is the part that's copied to dynamic superinstructions like the ones   
   shown above, and this length is what is used in the search for the   
   shortest path.   
      
   In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the   
   primitives of the other variants are selected to also result in the   
   shortest overall code.   
      
   In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of   
   SWAP, not even the shortest variant starting from state 2, but ending   
   in state 3 allows to use cheap variants of further primitives such as   
   !@ and !, resulting in the overall shortest code for this sequence.   
   In EXCHANGE2, we see the selection of a shorter version of SWAP, but   
   one of the costs is that ;s becomes longer (but in this case the   
   overall savings from using a shorter version of SWAP and shorter   
   versions of earlier instructions make up for that).   
      
   Why am I looking at this?  For stack-shuffling primitives like SWAP   
   and ROT, it's not obvious which variant is how long and which variant   
   should be selected.   
      
   These stack-shiffling words therefore are also good candidates for   
   performing stack-cache state transitions that would otherwise require   
   inserting extra transition code:   
      
   E.g., EXCHANGE and its variants consume two stack items, but need to   
   start in stack-cache state 1 and end in stack-cache state 1 (gforth is   
   currently not smart enough to deal with other states at basic-block   
   boundaries), so not everything can be done in the stack cache; the   
   stack pointer needs to be increased by two cells, and there need to be   
   accesses to the memory part of the stack for two stack items.   
      
   In EXCHANGE, the adjustment by one cell and memory access for one   
   stack item is done in the first SWAP 2->3, and another one in the   
   second one.  In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks.   
   In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache   
   state transition 0->1 in the first two instructions of ;s do it for   
   the other cell (gforth-fast actually does not have a built-in variant   
   S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed   
   of a transition 0->1 and the ;S 1->1 variant).   
      
   I wanted to know how often which variant of these stack-shuffling   
   primitives is used, and how this relates to their length.  One   
   interesting result is that ROT 1->3 is used relatively frequently   
   despite having relatively long code.  Apparently the code that comes   
   before these 17 instances of ROT benefits a lot from being in   
   stack-cache state 1 and this amortizes the longer code for the ROT   
   1->3 compared to ROT 2->3.   
      
   Another interesting result is the low usage of SWAP 3->2 compared to   
   SWAP 2->3.  This may say something about how SWAP is used in Forth   
   programs.  Or it may be an artifact of tie-breaking: If two paths have   
   the same length, one is chosen rather arbitrarily, but consistently,   
   and this may make one variant appear more useful than merited by the   
   benefit that the existence of the variant has on code length.   
      
   For those interested, here's the code for the various variants shown   
   above:   
      
   r12: stack pointer   
    r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3)   
   r15: stack-cache register b                 (tos in state 2, 2nd in state 3)   
    r9: stack-cache register c                                 (tos in state 3)   
      
   swap 1-1   
   559E7F769425:   mov     rax,$08[r12]   
   559E7F76942A:   mov     $08[r12],r8   
      
   [continued in next message]   
      
   --- SoupGate-DOS v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]