home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,765 of 4,675   
   Terje Mathisen to Bernhard Schornak   
   Re: Stack management strategies   
   07 Jan 19 08:46:28   
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Bernhard Schornak wrote:   
   > Terje Mathisen wrote:   
   >> Bernhard Schornak wrote:   
   >>> And, since I research this in depth for more than a decade   
   >>> now, I do know (and can prove it experimentally) that this   
   >>> kind of stack management is faster than abusing rBP. As it   
   >>   
   >> This is where you are wrong!   
   >   
   > Am I? Implies you are an entity empowered to determine what *has   
   > to be* wrong and what *has to be* right...   
      
   Sorry, we're getting into an argument because English isn't the native   
   language of either of us. :-(   
      
   The statement above was meant to be taken together with the following:   
   >   
   >   
   >> Not specifically, i.e. using a single ESP update followed by MOV is   
   >> probably faster on many cpus, but not in general:   
      
   I.e. I _accept_ that you have been measuring speedups!   
      
   Further down here it does seem like you have been doing these   
   measurements as isolated micro benchmarks, not by modifying gcc/clang to   
   generate different function prolog/epilog code for a big application.   
      
   What I tried to say is that this type of optimization (which is very   
   similar to the memcpy() / memmove() optimizations that have happened   
   over the last 10-15 years) is quite often worthwhile when a single   
   instance is called many times from many locations, but that it is far   
   harder to come up with something which also improves inline code, i.e.   
   when those big prolog/epilog sequences are repeated over 1000's of   
   functions.   
      
      
   >>   
   >> Simply because long series of PUSH/POP are so common in both compiler   
   >> and assembler code, cpu architects have a lot of good reasons for   
   >> trying to make this sort of code significantly faster, and they almost   
   >> certainly will do so. (The most obvious hw optimization is to regard   
   >> multiple sequential PUSH or POP operations as a single macro op, this   
   >> is effectively the same as using explicit MOV operations while   
   >> avoiding the code size impact.)   
   >   
   >   
   > Have a look at AMD and iNTEL optimisation guides regarding write   
   > combining (WC). Both tell you to prefer full 64 byte writes when   
   > you want to gain immense improvements. The WC logic is triggered   
   > by consecutive writes to ascending memory locations. When you're   
   > forcing a WC cycle via   
      
   This is yet another temporary target! I.e. likely to change between cpu   
   generations.   
      
   If you push enough data on the stack that you have to worry about actual   
   memory transfer rates, instead of just the L1 cache, then you probably   
   have more problems. WC logic is only important when you stream data   
   between memory buffers, in which case you should probably use even wider   
   (non-temporal) vector stores instead.   
      
   At one point in time there was a micro benchmark operating on big arrays   
   where the fastest implementation (on an AMD?) was to first prefetch   
   about 4KB of data into L1 cache by actually loading one byte from each   
   cache line (no prefetch hint operations!), then do the desired   
   operations in place reading and writing the same 4KB, before finally   
   using NT stores to copy the 4KB out to the target buffer.   
      
   Each line worth of data was touched 6 times instead of just 2 but the   
   final code ran 2-3 times faster. The key was that this code was   
   completely RAM limited so you could do arbitrary amounts of work inside   
   the cpu as long as the RAM chips were accessed in optimal patterns.   
      
   OK?   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca