Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,773 of 4,675    |
|    Bernhard Schornak to Terje Mathisen    |
|    Re: Stack management strategies (1/2)    |
|    10 Jan 19 00:03:53    |
   
   From: schornak@nospicedham.web.de   
      
   Terje Mathisen wrote:   
      
      
   > Bernhard Schornak wrote:   
   >> It depends. I am talking about improvements regarding one or two   
   >> cache lines, you're talking about shuffling around blocks with 4   
   >> KiBiByte at a time. Obviously, one of us missed the point... ;)   
   >   
   > You seem to be missing the main point of a cache: That is should avoid   
   accesses to secondary caches   
   > and RAM, right?   
   >   
   > For a local stack which will over time use and reuse pretty much the same   
   few cache lines over and   
   > over, those cache lines will _never_ be flushed out of $L1. If they are,   
   then you are moving so much   
   > memory around that the call/return overhead becomes negligible.   
      
      
   No. Only updates of invalidated cache lines (simply spoken: all   
   cache lines with changed content) are of concern. As I told you   
   in *each* reply (and repeat it now...): We talk about MOVing or   
   PUSHing at most 14 registers (and maybe a few variables) to the   
   stack.   
      
   A 'local stack' is not static. With each called function, a new   
   cache line might be assigned, while the old cache line might be   
   flushed if the time since the last access moved the priority to   
   the limit where that line is freed for incoming requests. As we   
   both do not know what the called function (and functions called   
   by that function, and so on) executes, we can't assume that the   
   cache line assigned to a specific stack location never changes.   
   Moreover, cache assignment will change (at the latest!) if your   
   time slice expires and the OS switches to the next task. To get   
   this straight: I don't make claims about a specific cache line,   
   I just make use of an existing automatism updating cache lines.   
      
      
   >> (And: I never mentioned 'memory transfer rates' anywhere. That's   
   >> completely besides the point. It's about the execution time of a   
   >> WC cycle versus the execution time of a PUSH sequence. This will   
   >> not exceed 112 byte (2 cache lines!), assuming rAX and rSP never   
   >> are saved on the stack.)   
   >   
   > There ARE no 'WC cycle' between CPU and $L1: When a given cache line is   
   already resident in the L1   
   > cache, then there is absolutely no reason to use Write Combining transfers.   
   >   
   > WC is only ever a win when you are writing to non-owned cache lines and the   
   sum of writes will fill   
   > one or more entire cache lines: In this cases using WC can avoid the need   
   for an initial load of the   
   > cache line ("read for ownership").   
      
      
   Cache lines *frequently* are updated whenever their content has   
   changed. All changes are written back to RAM if a cache line is   
   flushed, but the cache line itself must be kept up to date over   
   the entire 'lifetime' of the cached data. Each write to a cache   
   line invalidates its content and it is faster to group multiple   
   writes to avoid successive invalidations causing chains of many   
   timeouts to let the storage logic retire from the write. That's   
   why 'write combining' buffering was introduced. WC is triggered   
   automatically, so you cannot prevent it (if you do not run your   
   own OS, of course) from being performed whenever you write data   
   to memory.   
      
   The MOESI protocol doesn't force the processor to flush invalid   
   cache lines, but hierarchically lower storage units (L2, L3 and   
   RAM) may be updated to the current state when the corresponding   
   cache line changed its content (coherency).   
      
   What I told up to now was based on *physically verifiable* code   
   supporting my method. There must be a reason, why the PUSH-free   
   version is *reproducibly* faster. If my explanation is wrong in   
   your eyes, then you should present a more plausible explanation   
   why it *is* faster to use MOV on all AMD processors. I research   
   stack management since 2004 and can provide data for processors   
   since Athlon. I just never felt the urgent need to publish them   
   outside my ID paper (where other things are as important as WC,   
   anyway).   
      
   You now might argue that my measurements are bogus, but if they   
   were, there should be no difference between both probed methods   
   if your above claims about WC were true. Unfortunately, results   
   differ for all probed processor generations, suggesting that my   
   amateurish explanation is more accurate than the (partially un-   
   related) assumptions you posted until now.   
      
   Interestingly, the probed samples perfectly correlate with what   
   I calculated using the timings from AMD's optimisation guides -   
   might be another random hit, but: How plausible is it to assume   
   that several random events all hit the bull's eye each time you   
   perform a test?   
      
      
   >> Just imagine, how far it might influence the overall performance   
   >> of a program with thousands of functions performing thousands of   
   >> function calls while running. And then tell me again it does not   
   >> matter if a prologue and epilogue was executed in 10 rather than   
   >> 20 clocks per call. (Maybe I misunderstood you here, but this is   
   >> how I 'interpreted' your replies.) As a matter of fact - it does   
   >> matter, if you save 10 clocks per call. If you allow shoddy work   
   >> is 'performed' at the beginning and end of all functions, you do   
   >> not have to worry about optimisations for the remaining parts.   
   >   
   > If your functions are so small that the call/return overhead is at all   
   significant, then you are   
   > doing something wrong: Tiny functions typically don't need to save any   
   registers at all, or just one   
   > or two, and if you spend so little time in them that the ovehead is   
   noticable, then you should   
   > probably inline the logic and avoid the call itself.   
      
      
   I do know 'inlining' as crutch for HLL programmers, but I never   
   heard of 'inlining' assembler code in a program written in pure   
   assembler. Sounds more like a 'macro' (something I do not use),   
   but then you had written 'macro' and not 'inline'.   
      
   And you still didn't answer my question: Why should anyone care   
   about optimising a function if it is no issue to start from the   
   very beginning? Anyone trying to speed up a car never will tell   
   "The tires do not matter!". How can you claim you are concerned   
   about optimisation if you deny to optimise the entire code from   
   bottom to top? If you prefer to copy slow, but well established   
   standard templates rather than to write new and optimised code,   
   you don't have to care about optimisation for real.   
      
   You talk about randomly applied local improvement, I talk about   
   globally applied optimisation, where stack management is just a   
   tiny part of a full-sized puzzle. A car is more than just tires   
   (or chassis or motor), but the tires contribute their tiny part   
   to make the entire car a 'monster'. A program is more than some   
   decent stack management (or codec or encryption), but each part   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca