... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,772 of 4,675
Bernhard Schornak to Terje Mathisen
Re: Stack management strategies
08 Jan 19 08:49:42
   From: schornak@nospicedham.web.de   
      
   Terje Mathisen wrote:   
      
      
   > Bernhard Schornak wrote:   
   >> Terje Mathisen wrote:   
   >>> Bernhard Schornak wrote:   
   >>>> And, since I research this in depth for more than a decade   
   >>>> now, I do know (and can prove it experimentally) that this   
   >>>> kind of stack management is faster than abusing rBP. As it   
   >>>   
   >>> This is where you are wrong!   
   >>   
   >> Am I? Implies you are an entity empowered to determine what *has   
   >> to be* wrong and what *has to be* right...   
   >   
   > Sorry, we're getting into an argument because English isn't the native   
   language of either of us. :-(   
      
      
   I'm just pi**ed with suggestive statements with no real content.   
   It's not language dependent and I tend to accentuate things... ;)   
      
   (Therefore, I can understand if you're pi**ed with my reply, and   
   I apologise if my wording insulted you. Please notice that I use   
   that twinkling eye as a marker not to take the prior sentence(s)   
   too serious.)   
      
      
   > The statement above was meant to be taken together with the following:   
   >>   
   >>> Not specifically, i.e. using a single ESP update followed by MOV is   
   >>> probably faster on many cpus, but not in general:   
   >   
   > I.e. I _accept_ that you have been measuring speedups!   
      
      
   I did. It was stupid to post faked claims, knowing that everyone   
   easily can verify them with simple testing tools.   
      
      
   > Further down here it does seem like you have been doing these measurements   
   as isolated micro   
   > benchmarks, not by modifying gcc/clang to generate different function   
   prolog/epilog code for a big   
   > application.   
      
      
   I never had the idea to modify GCC. It is ways too complex and I   
   wanted to write my own stuff in this lifetime. Moreover, hacking   
   GCC would not help to get rid of the dirty programming practises   
   of all major operating systems - therefore, I went the other way   
   and developed a bunch of libraries following my ruleset. I admit   
   my programs are tiny compared to fully blown 'application suits'   
   with their sizes blown up to the GiBiByte range, but: Less might   
   be more in many cases.   
      
      
   > What I tried to say is that this type of optimization (which is very similar   
   to the memcpy() /   
   > memmove() optimizations that have happened over the last 10-15 years) is   
   quite often worthwhile when   
   > a single instance is called many times from many locations, but that it is   
   far harder to come up   
   > with something which also improves inline code, i.e. when those big   
   prolog/epilog sequences are   
   > repeated over 1000's of functions.   
      
      
   The conceptual design of "Intelligent Design" is based on recent   
   processor technology beginning with AMD's Athlon. The faster the   
   executed parts, the faster the application using 'em. The slower   
   the single functions, the slower the entire program.   
      
      
   >>> Simply because long series of PUSH/POP are so common in both compiler   
   >>> and assembler code, cpu architects have a lot of good reasons for   
   >>> trying to make this sort of code significantly faster, and they almost   
   >>> certainly will do so. (The most obvious hw optimization is to regard   
   >>> multiple sequential PUSH or POP operations as a single macro op, this   
   >>> is effectively the same as using explicit MOV operations while   
   >>> avoiding the code size impact.)   
   >>   
   >> Have a look at AMD and iNTEL optimisation guides regarding write   
   >> combining (WC). Both tell you to prefer full 64 byte writes when   
   >> you want to gain immense improvements. The WC logic is triggered   
   >> by consecutive writes to ascending memory locations. When you're   
   >> forcing a WC cycle via   
   >   
   > This is yet another temporary target! I.e. likely to change between cpu   
   generations.   
      
      
   Yes. Probably the reason why iNTEL finally jumped on this train,   
   too. Once established improvements don't 'vanish'. They might be   
   replaced by better ones, but: My libraries aren't static and are   
   updated to the available technologies whenever they change.   
      
      
   > If you push enough data on the stack that you have to worry about actual   
   memory transfer rates,   
   > instead of just the L1 cache, then you probably have more problems. WC logic   
   is only important when   
   > you stream data between memory buffers, in which case you should probably   
   use even wider   
   > (non-temporal) vector stores instead.   
   >   
   > At one point in time there was a micro benchmark operating on big arrays   
   where the fastest   
   > implementation (on an AMD?) was to first prefetch about 4KB of data into L1   
   cache by actually   
   > loading one byte from each cache line (no prefetch hint operations!), then   
   do the desired operations   
   > in place reading and writing the same 4KB, before finally using NT stores to   
   copy the 4KB out to the   
   > target buffer.   
   >   
   > Each line worth of data was touched 6 times instead of just 2 but the final   
   code ran 2-3 times   
   > faster. The key was that this code was completely RAM limited so you could   
   do arbitrary amounts of   
   > work inside the cpu as long as the RAM chips were accessed in optimal   
   patterns.   
   >   
   > OK?   
      
      
   It depends. I am talking about improvements regarding one or two   
   cache lines, you're talking about shuffling around blocks with 4   
   KiBiByte at a time. Obviously, one of us missed the point... ;)   
      
   (And: I never mentioned 'memory transfer rates' anywhere. That's   
   completely besides the point. It's about the execution time of a   
   WC cycle versus the execution time of a PUSH sequence. This will   
   not exceed 112 byte (2 cache lines!), assuming rAX and rSP never   
   are saved on the stack.)   
      
   Just imagine, how far it might influence the overall performance   
   of a program with thousands of functions performing thousands of   
   function calls while running. And then tell me again it does not   
   matter if a prologue and epilogue was executed in 10 rather than   
   20 clocks per call. (Maybe I misunderstood you here, but this is   
   how I 'interpreted' your replies.) As a matter of fact - it does   
   matter, if you save 10 clocks per call. If you allow shoddy work   
   is 'performed' at the beginning and end of all functions, you do   
   not have to worry about optimisations for the remaining parts.   
      
      
   Greetings from Augsburg   
      
   Bernhard Schornak   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]