Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,765 of 4,675    |
|    Terje Mathisen to Bernhard Schornak    |
|    Re: Stack management strategies    |
|    07 Jan 19 08:46:28    |
      From: terje.mathisen@nospicedham.tmsw.no              Bernhard Schornak wrote:       > Terje Mathisen wrote:       >> Bernhard Schornak wrote:       >>> And, since I research this in depth for more than a decade       >>> now, I do know (and can prove it experimentally) that this       >>> kind of stack management is faster than abusing rBP. As it       >>       >> This is where you are wrong!       >       > Am I? Implies you are an entity empowered to determine what *has       > to be* wrong and what *has to be* right...              Sorry, we're getting into an argument because English isn't the native       language of either of us. :-(              The statement above was meant to be taken together with the following:       >       >       >> Not specifically, i.e. using a single ESP update followed by MOV is       >> probably faster on many cpus, but not in general:              I.e. I _accept_ that you have been measuring speedups!              Further down here it does seem like you have been doing these       measurements as isolated micro benchmarks, not by modifying gcc/clang to       generate different function prolog/epilog code for a big application.              What I tried to say is that this type of optimization (which is very       similar to the memcpy() / memmove() optimizations that have happened       over the last 10-15 years) is quite often worthwhile when a single       instance is called many times from many locations, but that it is far       harder to come up with something which also improves inline code, i.e.       when those big prolog/epilog sequences are repeated over 1000's of       functions.                     >>       >> Simply because long series of PUSH/POP are so common in both compiler       >> and assembler code, cpu architects have a lot of good reasons for       >> trying to make this sort of code significantly faster, and they almost       >> certainly will do so. (The most obvious hw optimization is to regard       >> multiple sequential PUSH or POP operations as a single macro op, this       >> is effectively the same as using explicit MOV operations while       >> avoiding the code size impact.)       >       >       > Have a look at AMD and iNTEL optimisation guides regarding write       > combining (WC). Both tell you to prefer full 64 byte writes when       > you want to gain immense improvements. The WC logic is triggered       > by consecutive writes to ascending memory locations. When you're       > forcing a WC cycle via              This is yet another temporary target! I.e. likely to change between cpu       generations.              If you push enough data on the stack that you have to worry about actual       memory transfer rates, instead of just the L1 cache, then you probably       have more problems. WC logic is only important when you stream data       between memory buffers, in which case you should probably use even wider       (non-temporal) vector stores instead.              At one point in time there was a micro benchmark operating on big arrays       where the fastest implementation (on an AMD?) was to first prefetch       about 4KB of data into L1 cache by actually loading one byte from each       cache line (no prefetch hint operations!), then do the desired       operations in place reading and writing the same 4KB, before finally       using NT stores to copy the 4KB out to the target buffer.              Each line worth of data was touched 6 times instead of just 2 but the       final code ran 2-3 times faster. The key was that this code was       completely RAM limited so you could do arbitrary amounts of work inside       the cpu as long as the RAM chips were accessed in optimal patterns.              OK?              Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca