... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,774 of 4,675
Terje Mathisen to Bernhard Schornak
Re: Stack management strategies (1/2)
10 Jan 19 15:22:48
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Bernhard Schornak wrote:   
   > Terje Mathisen wrote:   
   >   
   >   
   >> Bernhard Schornak wrote:   
   >>> It depends. I am talking about improvements regarding one or two   
   >>> cache lines, you're talking about shuffling around blocks with 4   
   >>> KiBiByte at a time. Obviously, one of us missed the point... ;)   
   >>   
   >> You seem to be missing the main point of a cache: That is should avoid   
   >> accesses to secondary caches and RAM, right?   
   >>   
   >> For a local stack which will over time use and reuse pretty much the   
   >> same few cache lines over and over, those cache lines will _never_ be   
   >> flushed out of $L1. If they are, then you are moving so much memory   
   >> around that the call/return overhead becomes negligible.   
   >   
   >   
   > No. Only updates of invalidated cache lines (simply spoken: all   
   > cache lines with changed content) are of concern. As I told you   
   > in *each* reply (and repeat it now...): We talk about MOVing or   
      
   You keep repeating stuff, this does not make it more true.   
      
   > PUSHing at most 14 registers (and maybe a few variables) to the   
   > stack.   
      
   If you need 14 local registers in a function, then that function has to   
   take from 100 to 1000+ clock cycles, otherwise you are doing something   
   non-optimal.   
      
   I am NOT saying that you cannot make save/restore of registers faster on   
   several CPU models by having explicit MOVs instead of a bunch of   
   PUSH/POP, rather the opposite: This is indeed likely to make this   
   particular operation, in isolation, faster.   
      
   >   
   > A 'local stack' is not static. With each called function, a new   
   > cache line might be assigned, while the old cache line might be   
   > flushed if the time since the last access moved the priority to   
   > the limit where that line is freed for incoming requests. As we   
   > both do not know what the called function (and functions called   
   > by that function, and so on) executes, we can't assume that the   
   > cache line assigned to a specific stack location never changes.   
   > Moreover, cache assignment will change (at the latest!) if your   
   > time slice expires and the OS switches to the next task. To get   
   > this straight: I don't make claims about a specific cache line,   
   > I just make use of an existing automatism updating cache lines.   
      
   Bernhard, here you are in fact wrong: When a given stack location is   
   mapped to a given cache line, then it is extremely likely that it will   
   stay mapped to the same line for a _very_ long time, simply because you   
   are actively using that stack area for local vars and function   
   parameters. This is exactly how a cache is supposed to work, i.e.   
   significantly reduce the amount of traffic going all the way to RAM.   
      
   On a typical x86 cpu the $L1 is 8-way set associative which means that   
   you need heavy usage of the other 7 ways before the stack area will   
   become a victim of invalidation (and flushing to $L2/$L3/RAM).   
      
   Secondly:   
      
   Please reread all documentation on how WC works: It is _only_ ever   
   applicable when you want to write data to a fresh cache line and fill it   
   completeley so that you can avoid an initial read-for-ownership bus   
   transfer which is unneded since all the old data in the line is about to   
   be overwritten.   
      
   When you have done so then you can come back and explain how WC is   
   important for stack data.   
      
   If you can actually do this then I'll thank you for teaching me   
   something new and very unexpected (which is the best kind!), otherwise   
   please try to be more polite.   
      
   Terje   
      
   >   
   >   
   >>> (And: I never mentioned 'memory transfer rates' anywhere. That's   
   >>> completely besides the point. It's about the execution time of a   
   >>> WC cycle versus the execution time of a PUSH sequence. This will   
   >>> not exceed 112 byte (2 cache lines!), assuming rAX and rSP never   
   >>> are saved on the stack.)   
   >>   
   >> There ARE no 'WC cycle' between CPU and $L1: When a given cache line   
   >> is already resident in the L1 cache, then there is absolutely no   
   >> reason to use Write Combining transfers.   
   >>   
   >> WC is only ever a win when you are writing to non-owned cache lines   
   >> and the sum of writes will fill one or more entire cache lines: In   
   >> this cases using WC can avoid the need for an initial load of the   
   >> cache line ("read for ownership").   
   >   
   >   
   > Cache lines *frequently* are updated whenever their content has   
   > changed. All changes are written back to RAM if a cache line is   
   > flushed, but the cache line itself must be kept up to date over   
   > the entire 'lifetime' of the cached data. Each write to a cache   
   > line invalidates its content and it is faster to group multiple   
   > writes to avoid successive invalidations causing chains of many   
   > timeouts to let the storage logic retire from the write. That's   
   > why 'write combining' buffering was introduced. WC is triggered   
   > automatically, so you cannot prevent it (if you do not run your   
   > own OS, of course) from being performed whenever you write data   
   > to memory.   
   >   
   > The MOESI protocol doesn't force the processor to flush invalid   
   > cache lines, but hierarchically lower storage units (L2, L3 and   
   > RAM) may be updated to the current state when the corresponding   
   > cache line changed its content (coherency).   
   >   
   > What I told up to now was based on *physically verifiable* code   
   > supporting my method. There must be a reason, why the PUSH-free   
   > version is *reproducibly* faster. If my explanation is wrong in   
   > your eyes, then you should present a more plausible explanation   
   > why it *is* faster to use MOV on all AMD processors. I research   
   > stack management since 2004 and can provide data for processors   
   > since Athlon. I just never felt the urgent need to publish them   
   > outside my ID paper (where other things are as important as WC,   
   > anyway).   
   >   
   > You now might argue that my measurements are bogus, but if they   
   > were, there should be no difference between both probed methods   
   > if your above claims about WC were true. Unfortunately, results   
   > differ for all probed processor generations, suggesting that my   
   > amateurish explanation is more accurate than the (partially un-   
   > related) assumptions you posted until now.   
   >   
   > Interestingly, the probed samples perfectly correlate with what   
   > I calculated using the timings from AMD's optimisation guides -   
   > might be another random hit, but: How plausible is it to assume   
   > that several random events all hit the bull's eye each time you   
   > perform a test?   
   >   
   >   
   >>> Just imagine, how far it might influence the overall performance   
   >>> of a program with thousands of functions performing thousands of   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]