Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.c    |    Meh, in C you gotta define EVERYTHING    |    243,242 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 241,726 of 243,242    |
|    David Brown to bart    |
|    Re: New and improved version of cdecl    |
|    31 Oct 25 13:10:38    |
      From: david.brown@hesbynett.no              On 31/10/2025 00:23, bart wrote:              > People into compilers are obsessed with optimisation. It can be a       > necessity for languages that generate lots of redundant code that needs       > to be cleaned up, but not so much for C.       >       > Typical differences of between -O0 and -O2 compiled code can be 2:1.       >       > However even the most terrible native code will be a magnitude faster       > than interpreted code.       >              You live in a world of x86 (with brief visits to 64-bit ARM). You used       to work with smaller processors and lower level code, but seem to have       forgotten that long ago.              A prime characteristic of modern x86 processors is that they are       extremely good at running extremely bad code. They are targeted at       systems where being able to run old binaries is essential. A great deal       of the hardware in an x86 cpu core is there to handle poorly optimised       code - lots of jumps and function calls get predicted and speculated,       data that is pushed onto and pulled off the stack gets all kinds of fast       paths and short-circuits, and so on. And then there is the memory - if       code has to wait for data from ram, the cpu can happily execute hundreds       of cycles of unnecessary unoptimised code without making any difference       to the final speed.              Big ARM processors - such as on Pi's - have the same effects, though to       a somewhat lesser extent.              A prime characteristic of user programs on PC's and other "big" systems       is that a lot of the time is spent doing things other than running the       user code - file I/O, screen display, OS calls, or code in static       libraries, DLLs (or SOs), etc. That stuff is completely unaffected by       the efficiency of the user code - that's why interpreted or VM code is       fast enough for a very wide range of use-cases.              And if you are working with Windows systems with an MS DLL for the C       runtime library (as used by some C toolchains on Windows, but not all),       then you can get more distortions. If you have a call to memcpy that       uses an external DLL, that is going to take perhaps 500 clock cycles       even for a small fixed size of memcpy (assuming all code and data is in       cache). The user code for the call might be 10 cycles or 20 cycles       depending on the optimisation - compiler optimisation makes no       measurable difference here. But if the toolchain uses a static library       for memcpy and can optimise locally to replace the call, the static call       to general memcpy code might take 200 cycles while the local code takes       10 cycles. Suddenly the difference between optimising and       non-optimising is huge.              Then there is the type of code you are dealing with. Some code is very       cpu intensive and can benefit from optimisations, other code is not.              And optimisation is not just a matter of choosing -O0 or -O2 flags. It       can mean thought and changes in the source code (some standard C       changes, like use of "restrict" parameters, some compiler-specific       changes like gcc attributes or builtins, and some target specific like       organising data to fit cache usage). And it can mean careful flag       choices - different specific optimisations suitable for the code at       hand, and target related flags for enabling more target features. I am       entirely confident that you have done nothing of these things when       testing. That's not necessarily a bad thing in itself, when looking at       widely portable source compiled to generic binaries, but it gives a very       unrealistic picture of compiler optimisations and what can be achieved       by someone who knows how to work with their compiler.                     All this conspires to give you this 2:1 ratio that you regularly state       for the difference between optimised code and unoptimised code - gcc -O2       and gcc -O0.                     In reality, people can often achieve far greater ratios for the type of       code where performance matters and where it is is achievable. Someone       working on game engines on an x86 would probably expect at least 10       times difference between the flags they use, and no optimisation flags.       For the targets I use, which are (generally) not super-scaler,       out-of-order, etc., five to ten times difference is not uncommon. And       when you throw C++ or other modern languages into the mix (remember, gcc       and clang/llvm are not simple C compilers), the benefits of inlining and       other inter-procedural optimisations can easily be an order of       magnitude. (This is one reason why gcc and clang enable a number of       optimisations, including at least inlining of functions marked       appropriately, even with no optimisation flags specified.)                     You can continue to believe that high-end toolchains are no more than       twice as good as your own compiler or tcc, if you like. (And they give       you all the performance and features that you need, fine.) Those of us       who want more from our tools, and know how to get it, know better.              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca