... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,771 of 4,675
Terje Mathisen to Bernhard Schornak
Re: Stack management strategies
08 Jan 19 13:21:04
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Bernhard Schornak wrote:   
   > It depends. I am talking about improvements regarding one or two   
   > cache lines, you're talking about shuffling around blocks with 4   
   > KiBiByte at a time. Obviously, one of us missed the point... ;)   
      
   You seem to be missing the main point of a cache: That is should avoid   
   accesses to secondary caches and RAM, right?   
      
   For a local stack which will over time use and reuse pretty much the   
   same few cache lines over and over, those cache lines will _never_ be   
   flushed out of $L1. If they are, then you are moving so much memory   
   around that the call/return overhead becomes negligible.   
   >   
   > (And: I never mentioned 'memory transfer rates' anywhere. That's   
   > completely besides the point. It's about the execution time of a   
   > WC cycle versus the execution time of a PUSH sequence. This will   
   > not exceed 112 byte (2 cache lines!), assuming rAX and rSP never   
   > are saved on the stack.)   
      
   There ARE no 'WC cycle' between CPU and $L1: When a given cache line is   
   already resident in the L1 cache, then there is absolutely no reason to   
   use Write Combining transfers.   
      
   WC is only ever a win when you are writing to non-owned cache lines and   
   the sum of writes will fill one or more entire cache lines: In this   
   cases using WC can avoid the need for an initial load of the cache line   
   ("read for ownership").   
   >   
   > Just imagine, how far it might influence the overall performance   
   > of a program with thousands of functions performing thousands of   
   > function calls while running. And then tell me again it does not   
   > matter if a prologue and epilogue was executed in 10 rather than   
   > 20 clocks per call. (Maybe I misunderstood you here, but this is   
   > how I 'interpreted' your replies.) As a matter of fact - it does   
   > matter, if you save 10 clocks per call. If you allow shoddy work   
   > is 'performed' at the beginning and end of all functions, you do   
   > not have to worry about optimisations for the remaining parts.   
      
   If your functions are so small that the call/return overhead is at all   
   significant, then you are doing something wrong: Tiny functions   
   typically don't need to save any registers at all, or just one or two,   
   and if you spend so little time in them that the ovehead is noticable,   
   then you should probably inline the logic and avoid the call itself.   
      
   Terje   
   PS. I have been invited to work for nearly all the x86 CPU   
   manufacturers, I am currently working pro bono for a startup which tries   
   to make a register-less cpu architecture (millcomputing.com). I worked   
   on the Quake asm code and on the very first sw DVD player. I also wrote   
   the fastest Ogg Vorbis (open source sound codec) decoder in the world.   
   During the AES process I (with three other guys) tripled the speed of   
   one of the candidate algorithms.   
      
   What have you done?   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]