From: anton@mips.complang.tuwien.ac.at   
      
   Stephen Pelc writes:   
   >On 14 Sep 2024 at 08:19:52 CEST, "Anton Ertl" wrote:   
   >   
   >> locals stack   
   >> 401 336 gforth-fast (AMD64)   
   >> 179 132 lxf 1.6-982-823 (IA-32)   
   >> 182 119 VFX FX Forth for Linux IA32 Version: 4.72 (IA-32)   
   >> 241 159 VFX Forth 64 5.43 (AMD64)   
   >> 163 175 iforth-5.1 mini (AMD64)   
   >   
   >There are design decisions within locals that can impact optimisation.   
   >The design of locals in VFX was influenced by Don Colburn's Forth's   
   >and by a desire to use locals to simplify source code when interfacing   
   >to a host operating system. Many operating systems return data   
   >to the caller by passing the address of a variable/buffer as an input   
   >parameter. Locals that can have an accessible address make such   
   >code much easier to read and write.   
      
   Gforth has had variable-flavoured locals from the start, and   
   implemented VFX's local-buffer syntax some time ago without problems,   
   so Gforth's design decisions are obviously compatible with these   
   requirements.   
      
   Now Gforth's numbers above are the worst of all Forth systems, so why   
   would Gforth be relevant? The native code for locals by iForth seems   
   to be very much in the same spirit: A separate locals stack, and   
   locals are accessed relative to the locals-stack pointer; and iForth   
   has the best locals code size of all (but looking at the VFX code, my   
   guess is that this happens to be in the present case mainly because   
   iForth uses RSP for the data stack and some other stack for the return   
   stack). Actually, even with your approach of keeping the locals on   
   the return stack, and having a separate locals-frame pointer, I don't   
   see why the locals code should be worse. But looking at the start of   
   the VFX64 code for VICHECK1, there is a bit of superfluous work:   
      
   : VICHECK1 {: pindex paddr -- pindex' paddr :} \ Checks for valid index   
   \ paddr is the address of the data, the first cell of which contains   
   \ the array size   
    pindex 0 paddr @ WITHIN IF \ Index is valid   
      
   VICHECK1   
   ( 0050A460 488BD4 ) MOV RDX, RSP   
   ( 0050A463 48FF7500 ) PUSH QWORD [RBP]   
   ( 0050A467 53 ) PUSH RBX   
   ( 0050A468 52 ) PUSH RDX   
   ( 0050A469 57 ) PUSH RDI   
   ( 0050A46A 488BFC ) MOV RDI, RSP   
   ( 0050A46D 4881EC00000000 ) SUB RSP, # 00000000   
   ( 0050A474 488B5D08 ) MOV RBX, [RBP+08]   
   ( 0050A478 488D6D10 ) LEA RBP, [RBP+10]   
   ( 0050A47C 488B5710 ) MOV RDX, [RDI+10]   
   ( 0050A480 488B12 ) MOV RDX, 0 [RDX]   
   ( 0050A483 B900000000 ) MOV ECX, # 00000000   
   ( 0050A488 482BD1 ) SUB RDX, RCX   
   ( 0050A48B 488B4718 ) MOV RAX, [RDI+18]   
   ( 0050A48F 482BC1 ) SUB RAX, RCX   
   ( 0050A492 483BC2 ) CMP RAX, RDX   
   ( 0050A495 0F8319000000 ) JNB/AE 0050A4B4   
      
   It's not clear to me why you push so much on the return stack at the   
   start, instead of just the two values pindex and paddr (which you do   
   in 0050A463 and 0050A467). Ok, you also push old locals-frame pointer   
   RDI in 0050A469, which is a result of having the locals on the return   
   stack instead of in a separate stack, but why push the old return   
   stack pointer? You know the size of your locals, just adjust RSP by   
   that much in the end.   
      
   The instruction at 0050A46D seems superfluous. My guess is that it's   
   there for the possible | part in the locals definition.   
      
   The next two instructions refill the TOS register RBX and adjust the   
   data stack pointer RBP. That completes the code for the locals   
   definition. From then on locals are loaded from memory, as   
   in iforth. Let's also inspect the end:   
      
    0 paddr \ Use zeroth index   
    THEN ;   
      
   ( 0050A535 488D6DF0 ) LEA RBP, [RBP+-10]   
   ( 0050A539 48C7450000000000 ) MOV QWord [RBP], # 00000000   
   ( 0050A541 48895D08 ) MOV [RBP+08], RBX   
   ( 0050A545 488B5F10 ) MOV RBX, [RDI+10]   
   ( 0050A549 488B6708 ) MOV RSP, [RDI+08]   
   ( 0050A54D 488B3F ) MOV RDI, 0 [RDI]   
   ( 0050A550 C3 ) RET/NEXT   
      
   The THEN is right before 0050A549. The code before THEN pushes 0 and paddr   
   on the data stack, and stores the former TOS in memory before loading   
   the new TOS. The three instructions after the THEN restore the return   
   stack and locals-frame pointer and return.   
      
   So there is a little bit that can be done without much effort, but not   
   much.   
      
   I always thought that a separate locals stack is a thing I did in   
   Gforth out of lazyness, and pay for it by having to maintain a   
   separate stack pointer, but it turns out that with locals on the   
   return stack, you still need an extra register for locals in memory,   
   and you spend additional overhead.   
      
   >In the last   
   >decade or so there has been very little customer demand for   
   >faster code.   
      
   See below.   
      
   >However, higher level source code has been much   
   >in demand. An example is Nick Nelson's value flavoured structures,   
   >which are of particular merit when converting code from 32 bit to   
   >64 bit host Forths.   
      
   Gforth has worked on 64-bit hosts since early 1996, and I found that   
   Forth code tends to have fewer portability problems between 32-bit and   
   64-bit platforms than C code, and that's not just my code, the   
   applications in appbench and many others are also quite portable.   
      
   A major merit for value-flavoured structures is that you can change   
   the field size (e.g, from 1 byte to 2 bytes or vice versa) without   
   changing all the code accessing those fields. That's independent of   
   cell size.   
      
   >Just because many of the Forth applications visible to the Forth   
   >community now run on CPUs with 16 or 32 address registers   
   >does not mean that all systems can implement the compiler   
   >techniques required for high-performance locals.   
      
   It's obvious that hardly any Forth system implements register   
   allocation of locals, with the exception being lxf, which uses an   
   architecture with 8 general-purpose registers (address registers   
   recall bad memories from the 68000 days); and for lxf, register   
   allocation is limited to basic blocks or less.   
      
   >I can buy a lot of CPU cycles for the cost of one day of programmer   
   >time.   
      
   Some guy called Stephen Pelc (must be a different one) recentlu posted   
   :   
      
   |We (MPE) converted much of our TCP/IP stack not to use locals. This   
   |was mostly on ARM7 devices, but the figures for other 32 bit CPUs of   
   |the period (say 15 years ago) were similar. Code density improved by   
   |about 25% and performance by about 50%.   
      
   How much time did that conversion cost? And this Stephen Pelc   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|