From: anton@mips.complang.tuwien.ac.at   
      
   minforth writes:   
   >Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:   
   >> The problem with 3DUP is that it is actually used in context.   
   >> What is the data that is going to 3DUP ped? In my view this   
   >> amounts to double use of data that is in registers (32 in the riscv)   
   >> anyway, after an optimiser does his thing.   
   >>   
   >   
   >Code inlining will mend it.   
      
   Inlining is important for Forth, but it does not make what has been   
   called an "analytical optimizer" unnecessary; on the contraray,   
   inlining increases the benefit we get from the analytical optimizer.   
   E.g., let's consider   
      
   : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;   
   : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;   
   : 3dup.3 {: a b c :} a b c a b c ;   
   : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;   
      
   : foo.1 3dup.1 + ! ;   
   : foo.2 3dup.2 + ! ;   
   : foo.3 3dup.3 + ! ;   
   : foo.4 3dup.4 + ! ;   
      
   The result produced by VFX64 is:   
      
   foo.1 foo.2 foo.3 foo.4   
   PUSH EBX MOV EDX, EBX CALL 3DUP.3 MOV EDX, EBX   
   MOV EBX, [ESP] ADD EBX, [EBP] ADD EBX, [EBP] ADD EBX, [EBP]   
   POP EDX MOV ECX, [EBP+04] MOV EDX, [EBP+04] MOV ECX, [EBP+04]   
   ADD EDX, [EBP] MOV 0 [EBX], ECX MOV 0 [EBX], EDX MOV 0 [EBX], ECX   
   MOV ECX, [EBP+04] MOV EBX, EDX MOV EBX, [EBP+08] MOV EBX, EDX   
   MOV 0 [EDX], ECX NEXT, LEA EBP, [EBP+0C] NEXT,   
   NEXT, NEXT,   
      
   VFX is only analytical about the data stack, and as a consequence, the   
   implementations of 3dup that only use the data stack work best. When   
   the return stack is used, as in 3dup.1/foo.1, VFX produces   
   instructions (PUSH for >R, MOV ..., [ESP] for R@ and POP for R>) for   
   the return-stack operations. When locals are used, VFX actually   
   disables inlining and just calls 3DUP.3.   
      
   Other Forth systems make too little use of inlining, and I have to   
   resort to macros to simulate it. We cannot use proper macros for   
   3dup.3 (the locals-using variant), so I used EVALUATE-based macros;   
   this is just for experimental use, not for production, don't do this   
   at home:-)   
      
   Let's see what VFX64 produces for FOO.3 with this:   
      
   FOO.3   
   ( 080C0C50 8BD4 ) MOV EDX, ESP   
   ( 080C0C52 FF7504 ) PUSH [EBP+04]   
   ( 080C0C55 FF7500 ) PUSH [EBP]   
   ( 080C0C58 53 ) PUSH EBX   
   ( 080C0C59 52 ) PUSH EDX   
   ( 080C0C5A 57 ) PUSH EDI   
   ( 080C0C5B 8BFC ) MOV EDI, ESP   
   ( 080C0C5D 81EC00000000 ) SUB ESP, 00000000   
   ( 080C0C63 8B5D08 ) MOV EBX, [EBP+08]   
   ( 080C0C66 8D6D0C ) LEA EBP, [EBP+0C]   
   ( 080C0C69 8B5708 ) MOV EDX, [EDI+08]   
   ( 080C0C6C 03570C ) ADD EDX, [EDI+0C]   
   ( 080C0C6F 8B4F08 ) MOV ECX, [EDI+08]   
   ( 080C0C72 8B470C ) MOV EAX, [EDI+0C]   
   ( 080C0C75 8D6DEC ) LEA EBP, [EBP+-14]   
   ( 080C0C78 894D04 ) MOV [EBP+04], ECX   
   ( 080C0C7B 894508 ) MOV [EBP+08], EAX   
   ( 080C0C7E 8B4F10 ) MOV ECX, [EDI+10]   
   ( 080C0C81 894D0C ) MOV [EBP+0C], ECX   
   ( 080C0C84 895D10 ) MOV [EBP+10], EBX   
   ( 080C0C87 8BDA ) MOV EBX, EDX   
   ( 080C0C89 8B5710 ) MOV EDX, [EDI+10]   
   ( 080C0C8C 895500 ) MOV [EBP], EDX   
   ( 080C0C8F 8B5500 ) MOV EDX, [EBP]   
   ( 080C0C92 8913 ) MOV 0 [EBX], EDX   
   ( 080C0C94 8B5D04 ) MOV EBX, [EBP+04]   
   ( 080C0C97 8D6D08 ) LEA EBP, [EBP+08]   
   ( 080C0C9A 8B6704 ) MOV ESP, [EDI+04]   
   ( 080C0C9D 8B3F ) MOV EDI, 0 [EDI]   
   ( 080C0C9F C3 ) NEXT,   
      
   So inlining did not mend that.   
      
   Here's what lxf produces:   
      
   foo.1 foo.2 foo.3 foo.4   
   mov eax , ebx mov eax , ebx mov eax , ebx mov eax , ebx   
   add eax , [ebp] add eax , [ebp] add eax , [ebp] add eax , [ebp]   
   mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h]   
   mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx   
   ret near ret near ret near ret near   
      
   So, because lxf is analytical about the return stack (and, through   
   that, about locals), inlining produces the same very good code in all   
   these cases.   
      
   You may notice that lxf produces a register-register move less than   
   VFX does for FOO.2/FOO.4. That's because VFX decided to modify the   
   TOS register (and has to restore it later), whereas lxf decided to   
   modify a copy of that register. One would have to make additional   
   observations to determine if lxf was just lucky here or if it   
   consistently makes the right decision in such cases.   
      
   And here's the code that gforth-fast (which does not have an   
   analytical optimizer) produces:   
      
   foo.1 foo.2 foo.3 foo.4   
   >r 1->0 third 1->1 >l >l 1->1 dup 1->1   
    mov -8[r14],r13 mov [r10],r13 >l 1->1 mov [r10],r13   
    sub r14,$08 sub r10,$08 mov -$08[rbp],r13 sub r10,$08   
   2dup 0->2 mov r13,$18[r10] mov rdx,$08[r10] 2over 1->3   
    mov r13,$10[r10] third 1->2 mov rax,rbp mov r15,$18[r10]   
    mov r15,$08[r10] mov r15,$10[r10] add r10,$10 mov r9,$10[r10]   
   i 2->3 third 2->3 lea rbp,-$10[rbp] rot 3->3   
    mov r9,[r14] mov r9,$08[r10] mov -$10[rax],rdx mov rax,r13   
   -rot 3->2 + 3->2 mov r13,[r10] mov r13,r15   
    mov [r10],r9 add r15,r9 >l @local0 1->1 mov r15,r9   
    sub r10,$08 ! 2->0 @local0 1->1 mov r9,rax   
   r> 2->3 mov [r15],r13 mov rax,rbp + 3->2   
    mov r9,[r14] ;s 0->1 lea rbp,-$08[rbp] add r15,r9   
    add r14,$08 mov r13,$08[r10] mov -$08[rax],r13 ! 2->0   
   + 3->2 add r10,$08 @local1 1->2 mov [r15],r13   
    add r15,r9 mov rbx,[r14] mov r15,$08[rbp] ;s 0->1   
   ! 2->0 add r14,$08 @local2 2->3 mov r13,$08[r10]   
    mov [r15],r13 mov rax,[rbx] mov r9,$10[rbp] add r10,$08   
   s 0->1 jmp eax @local0 3->1 mov rbx,[r14]   
    mov r13,$08[r10] mov -$10[r10],r9 add r14,$08   
    add r10,$08 sub r10,$18 mov rax,[rbx]   
    mov rbx,[r14] mov $10[r10],r15 jmp eax   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|