... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.forth
Forth programmers eat a lot of Bratwurst
117,927 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 116,742 of 117,927
Anton Ertl to Stephen Pelc
Re: Avoid treating the stack as an array
15 Sep 24 16:16:34
   From: anton@mips.complang.tuwien.ac.at   
      
   Stephen Pelc  writes:   
   >On 14 Sep 2024 at 08:19:52 CEST, "Anton Ertl"  wrote:   
   >   
   >> locals    stack   
   >> 401       336   gforth-fast (AMD64)   
   >> 179       132   lxf 1.6-982-823 (IA-32)   
   >> 182       119   VFX FX Forth for Linux IA32 Version: 4.72 (IA-32)   
   >> 241       159   VFX Forth 64 5.43 (AMD64)   
   >> 163       175   iforth-5.1 mini (AMD64)   
   >   
   >There are design decisions within locals that can impact optimisation.   
   >The design of locals in VFX was influenced by Don Colburn's Forth's   
   >and by a desire to use locals to simplify source code when interfacing   
   >to a host operating system. Many operating systems return data   
   >to the caller by passing the address of a variable/buffer as an input   
   >parameter. Locals that can have an accessible address make such   
   >code much easier to read and write.   
      
   Gforth has had variable-flavoured locals from the start, and   
   implemented VFX's local-buffer syntax some time ago without problems,   
   so Gforth's design decisions are obviously compatible with these   
   requirements.   
      
   Now Gforth's numbers above are the worst of all Forth systems, so why   
   would Gforth be relevant?  The native code for locals by iForth seems   
   to be very much in the same spirit: A separate locals stack, and   
   locals are accessed relative to the locals-stack pointer; and iForth   
   has the best locals code size of all (but looking at the VFX code, my   
   guess is that this happens to be in the present case mainly because   
   iForth uses RSP for the data stack and some other stack for the return   
   stack).  Actually, even with your approach of keeping the locals on   
   the return stack, and having a separate locals-frame pointer, I don't   
   see why the locals code should be worse.  But looking at the start of   
   the VFX64 code for VICHECK1, there is a bit of superfluous work:   
      
   : VICHECK1 {: pindex paddr -- pindex' paddr :} \ Checks for valid index   
   \ paddr is the address of the data, the first cell of which contains   
   \ the array size   
       pindex 0 paddr @ WITHIN IF \ Index is valid   
      
   VICHECK1   
   ( 0050A460    488BD4 )                MOV     RDX, RSP   
   ( 0050A463    48FF7500 )              PUSH    QWORD [RBP]   
   ( 0050A467    53 )                    PUSH    RBX   
   ( 0050A468    52 )                    PUSH    RDX   
   ( 0050A469    57 )                    PUSH    RDI   
   ( 0050A46A    488BFC )                MOV     RDI, RSP   
   ( 0050A46D    4881EC00000000 )        SUB     RSP, # 00000000   
   ( 0050A474    488B5D08 )              MOV     RBX, [RBP+08]   
   ( 0050A478    488D6D10 )              LEA     RBP, [RBP+10]   
   ( 0050A47C    488B5710 )              MOV     RDX, [RDI+10]   
   ( 0050A480    488B12 )                MOV     RDX, 0 [RDX]   
   ( 0050A483    B900000000 )            MOV     ECX, # 00000000   
   ( 0050A488    482BD1 )                SUB     RDX, RCX   
   ( 0050A48B    488B4718 )              MOV     RAX, [RDI+18]   
   ( 0050A48F    482BC1 )                SUB     RAX, RCX   
   ( 0050A492    483BC2 )                CMP     RAX, RDX   
   ( 0050A495    0F8319000000 )          JNB/AE  0050A4B4   
      
   It's not clear to me why you push so much on the return stack at the   
   start, instead of just the two values pindex and paddr (which you do   
   in 0050A463 and 0050A467).  Ok, you also push old locals-frame pointer   
   RDI in 0050A469, which is a result of having the locals on the return   
   stack instead of in a separate stack, but why push the old return   
   stack pointer?  You know the size of your locals, just adjust RSP by   
   that much in the end.   
      
   The instruction at 0050A46D seems superfluous.  My guess is that it's   
   there for the possible | part in the locals definition.   
      
   The next two instructions refill the TOS register RBX and adjust the   
   data stack pointer RBP.  That completes the code for the locals   
   definition.  From then on locals are loaded from memory, as   
   in iforth.  Let's also inspect the end:   
      
           0 paddr \ Use zeroth index   
       THEN ;   
      
   ( 0050A535    488D6DF0 )              LEA     RBP, [RBP+-10]   
   ( 0050A539    48C7450000000000 )      MOV     QWord [RBP], # 00000000   
   ( 0050A541    48895D08 )              MOV     [RBP+08], RBX   
   ( 0050A545    488B5F10 )              MOV     RBX, [RDI+10]   
   ( 0050A549    488B6708 )              MOV     RSP, [RDI+08]   
   ( 0050A54D    488B3F )                MOV     RDI, 0 [RDI]   
   ( 0050A550    C3 )                    RET/NEXT   
      
   The THEN is right before 0050A549.  The code before THEN pushes 0 and paddr   
   on the data stack, and stores the former TOS in memory before loading   
   the new TOS.  The three instructions after the THEN restore the return   
   stack and locals-frame pointer and return.   
      
   So there is a little bit that can be done without much effort, but not   
   much.   
      
   I always thought that a separate locals stack is a thing I did in   
   Gforth out of lazyness, and pay for it by having to maintain a   
   separate stack pointer, but it turns out that with locals on the   
   return stack, you still need an extra register for locals in memory,   
   and you spend additional overhead.   
      
   >In the last   
   >decade or so there has been very little customer demand for   
   >faster code.   
      
   See below.   
      
   >However, higher level source code has been much   
   >in demand. An example is Nick Nelson's value flavoured structures,   
   >which are of particular merit when converting code from 32 bit to   
   >64 bit host Forths.   
      
   Gforth has worked on 64-bit hosts since early 1996, and I found that   
   Forth code tends to have fewer portability problems between 32-bit and   
   64-bit platforms than C code, and that's not just my code, the   
   applications in appbench and many others are also quite portable.   
      
   A major merit for value-flavoured structures is that you can change   
   the field size (e.g, from 1 byte to 2 bytes or vice versa) without   
   changing all the code accessing those fields.  That's independent of   
   cell size.   
      
   >Just because many of the Forth applications visible to the Forth   
   >community now run on CPUs with 16 or 32 address registers   
   >does not mean that all systems can implement the compiler   
   >techniques required for high-performance locals.   
      
   It's obvious that hardly any Forth system implements register   
   allocation of locals, with the exception being lxf, which uses an   
   architecture with 8 general-purpose registers (address registers   
   recall bad memories from the 68000 days); and for lxf, register   
   allocation is limited to basic blocks or less.   
      
   >I can buy a lot of CPU cycles for the cost of one day of programmer   
   >time.   
      
   Some guy called Stephen Pelc (must be a different one) recentlu posted   
   :   
      
   |We (MPE) converted much of our TCP/IP stack not to use locals. This   
   |was mostly on ARM7 devices, but the figures for other 32 bit CPUs of   
   |the period (say 15 years ago) were similar. Code density improved by   
   |about 25% and performance by about 50%.   
      
   How much time did that conversion cost?  And this Stephen Pelc   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]