From: krishna.myneni@ccreweb.org   
      
   On 4/14/24 10:19, Anton Ertl wrote:   
   > Krishna Myneni writes:   
   >> dx/dt = sigma*(y - x)   
   >> dy/dt = x*(rho -z) - y   
   >> dz/dt = x*y - beta*z   
   >>   
   >> where sigma, rho, and beta are constant parameters.   
   >>   
   >> Let's say we want to write a word DERIVS which computes and stores the   
   >> derivatives, given the instantaneous values of x, y, z. This is the   
   >> basis for any numerical code which solves the trajectory in time,   
   >> starting from an initial condition.   
   >>   
   >> DERIVS ( F: x y z -- )   
   >>   
   >> Hence, we want to place some values x, y, and z onto the fp stack and   
   >> compute the three derivatives. Ideally these three values remain on the   
   >> fp stack and don't need to be fetched from memory constantly until the   
   >> three derivatives are computed, especially if one is using the hardware   
   >> fp stack. We allow the constant parameters to be fetched from memory and   
   >> the results of the derivative computation to be stored to memory so they   
   >> don't overflow the stack. This should be doable with the 8-element   
   >> hardware fp stack.   
   >   
   > I have adapted your Forth code:   
   >   
   > [UNDEFINED] F2OVER [IF]   
   > : f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )   
   > 3 fpick 3 fpick ;   
   > [THEN]   
   >   
   > 16.0e0 fconstant sigma   
   > 45.92e0 fconstant rho   
   > 4.0e0 fconstant beta   
   >   
   > fvariable dx/dt   
   > fvariable dy/dt   
   > fvariable dz/dt   
   >   
   > : derivs ( F: x y z -- )   
   > fdup f2over \ F: x y z z x y   
   > f- sigma f* fnegate   
   > dx/dt f! \ F: x y z z   
   > rho fover f- \ F: x y z z rho-z   
   > 4 fpick f* \ F: x y z z x*(rho - z)   
   > 3 fpick f-   
   > dy/dt f! \ F: x y z z   
   > fdrop   
   > beta f* fnegate   
   > frot frot f* f+ dz/dt f!   
   > ;   
   >   
   > 0.1e 0.6e 4.0e derivs   
   > dx/dt f@ f. cr \ 8.   
   > dy/dt f@ f. cr \ 3.592   
   > dz/dt f@ f. cr \ -15.94   
   >   
   > In particular, I eliminated the additional memory accesses to DZ/DT.   
   >   
      
   Nice. FROT FROT is expensive on a memory based FP stack, unless it is   
   optimized by the compiler, but for fpu stack use it's probably very   
   fast. I see that VFX Forth and iforth use a series of FXCH instructions   
   to implement FROT FROT.   
      
   > SwiftForth, VFX and iforth produce the expected results for your test   
   > case. The code is:   
   >   
   > SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini   
   > ST(0) FLD FLD ST fld ST(0)   
   > 44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte   
   > ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)   
   > 44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby   
   > ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]   
   > FCHS FCHS fxch ST(3)   
   > -8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)   
   > RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)   
   > 4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)   
   > 0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST   
   > 0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte   
   > 8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST   
   > 44E923 ( rho ) CALL MOV EBX, # 00000004 fchs   
   > ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte   
   > ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte   
   > -8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)   
   > RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST   
   > 4 # EBX MOV MOV EBX, # 00000003 fld ST(4)   
   > 43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST   
   > ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)   
   > -8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST   
   > RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte   
   > 3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)   
   > 43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte   
   > ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST   
   > -8 [RBP] RBP LEA FXCH ST(1) fchs   
   > RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)   
   > 4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)   
   > 0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)   
   > 0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)   
   > 8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST   
   > ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)   
   > 44E94B ( beta ) CALL RET/NEXT fpopswap,   
   > ST(0) ST(1) FMULP faddp ST(1), ST   
   > FCHS fstp $102264B0 tbyte   
   > 43C807 ( FROT ) CALL ;   
   > 43C807 ( FROT ) CALL   
   > ST(0) ST(1) FMULP   
   > ST(0) ST(1) FADDP   
   > -8 [RBP] RBP LEA   
   > RBX 0 [RBP] MOV   
   > 4C558 [RDI] RBX LEA   
   > 0 [RBX] TBYTE FSTP   
   > 0 [RBP] RBX MOV   
   > 8 [RBP] RBP LEA   
   > RET   
   >   
   > FPICK is apparently implemented on SwiftForth and VFX through an   
   > indirect branch that branches to one of 8 variants of "FLD ST(...)",   
   > while iForth manages to resolve this during compilation.   
   >   
      
   Good to see that x, y, z are not repeatedly fetched from memory.   
      
   For this example, the hardware fpu stack is sufficient. But, it's easy   
   to see that the benefits of a hardware-only stack would diminish quickly   
   as the size of the problem increased a small amount, and then the   
   programmer (or compiler) would have to keep careful track of how many   
   fpu registers are used.   
      
   > I have also looked at VFX 5.11 which uses XMM registers instead of the   
   > FP stack, but it does not inline FP operations, so you mostly see a long   
   > sequence of calls.   
   >   
      
   --   
   Krishna   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|