Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,772 of 4,675    |
|    Bernhard Schornak to Terje Mathisen    |
|    Re: Stack management strategies    |
|    08 Jan 19 08:49:42    |
      From: schornak@nospicedham.web.de              Terje Mathisen wrote:                     > Bernhard Schornak wrote:       >> Terje Mathisen wrote:       >>> Bernhard Schornak wrote:       >>>> And, since I research this in depth for more than a decade       >>>> now, I do know (and can prove it experimentally) that this       >>>> kind of stack management is faster than abusing rBP. As it       >>>       >>> This is where you are wrong!       >>       >> Am I? Implies you are an entity empowered to determine what *has       >> to be* wrong and what *has to be* right...       >       > Sorry, we're getting into an argument because English isn't the native       language of either of us. :-(                     I'm just pi**ed with suggestive statements with no real content.       It's not language dependent and I tend to accentuate things... ;)              (Therefore, I can understand if you're pi**ed with my reply, and       I apologise if my wording insulted you. Please notice that I use       that twinkling eye as a marker not to take the prior sentence(s)       too serious.)                     > The statement above was meant to be taken together with the following:       >>       >>> Not specifically, i.e. using a single ESP update followed by MOV is       >>> probably faster on many cpus, but not in general:       >       > I.e. I _accept_ that you have been measuring speedups!                     I did. It was stupid to post faked claims, knowing that everyone       easily can verify them with simple testing tools.                     > Further down here it does seem like you have been doing these measurements       as isolated micro       > benchmarks, not by modifying gcc/clang to generate different function       prolog/epilog code for a big       > application.                     I never had the idea to modify GCC. It is ways too complex and I       wanted to write my own stuff in this lifetime. Moreover, hacking       GCC would not help to get rid of the dirty programming practises       of all major operating systems - therefore, I went the other way       and developed a bunch of libraries following my ruleset. I admit       my programs are tiny compared to fully blown 'application suits'       with their sizes blown up to the GiBiByte range, but: Less might       be more in many cases.                     > What I tried to say is that this type of optimization (which is very similar       to the memcpy() /       > memmove() optimizations that have happened over the last 10-15 years) is       quite often worthwhile when       > a single instance is called many times from many locations, but that it is       far harder to come up       > with something which also improves inline code, i.e. when those big       prolog/epilog sequences are       > repeated over 1000's of functions.                     The conceptual design of "Intelligent Design" is based on recent       processor technology beginning with AMD's Athlon. The faster the       executed parts, the faster the application using 'em. The slower       the single functions, the slower the entire program.                     >>> Simply because long series of PUSH/POP are so common in both compiler       >>> and assembler code, cpu architects have a lot of good reasons for       >>> trying to make this sort of code significantly faster, and they almost       >>> certainly will do so. (The most obvious hw optimization is to regard       >>> multiple sequential PUSH or POP operations as a single macro op, this       >>> is effectively the same as using explicit MOV operations while       >>> avoiding the code size impact.)       >>       >> Have a look at AMD and iNTEL optimisation guides regarding write       >> combining (WC). Both tell you to prefer full 64 byte writes when       >> you want to gain immense improvements. The WC logic is triggered       >> by consecutive writes to ascending memory locations. When you're       >> forcing a WC cycle via       >       > This is yet another temporary target! I.e. likely to change between cpu       generations.                     Yes. Probably the reason why iNTEL finally jumped on this train,       too. Once established improvements don't 'vanish'. They might be       replaced by better ones, but: My libraries aren't static and are       updated to the available technologies whenever they change.                     > If you push enough data on the stack that you have to worry about actual       memory transfer rates,       > instead of just the L1 cache, then you probably have more problems. WC logic       is only important when       > you stream data between memory buffers, in which case you should probably       use even wider       > (non-temporal) vector stores instead.       >       > At one point in time there was a micro benchmark operating on big arrays       where the fastest       > implementation (on an AMD?) was to first prefetch about 4KB of data into L1       cache by actually       > loading one byte from each cache line (no prefetch hint operations!), then       do the desired operations       > in place reading and writing the same 4KB, before finally using NT stores to       copy the 4KB out to the       > target buffer.       >       > Each line worth of data was touched 6 times instead of just 2 but the final       code ran 2-3 times       > faster. The key was that this code was completely RAM limited so you could       do arbitrary amounts of       > work inside the cpu as long as the RAM chips were accessed in optimal       patterns.       >       > OK?                     It depends. I am talking about improvements regarding one or two       cache lines, you're talking about shuffling around blocks with 4       KiBiByte at a time. Obviously, one of us missed the point... ;)              (And: I never mentioned 'memory transfer rates' anywhere. That's       completely besides the point. It's about the execution time of a       WC cycle versus the execution time of a PUSH sequence. This will       not exceed 112 byte (2 cache lines!), assuming rAX and rSP never       are saved on the stack.)              Just imagine, how far it might influence the overall performance       of a program with thousands of functions performing thousands of       function calls while running. And then tell me again it does not       matter if a prologue and epilogue was executed in 10 rather than       20 clocks per call. (Maybe I misunderstood you here, but this is       how I 'interpreted' your replies.) As a matter of fact - it does       matter, if you save 10 clocks per call. If you allow shoddy work       is 'performed' at the beginning and end of all functions, you do       not have to worry about optimisations for the remaining parts.                     Greetings from Augsburg              Bernhard Schornak              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca