... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 241,726 of 243,242
David Brown to bart
Re: New and improved version of cdecl
31 Oct 25 13:10:38
   From: david.brown@hesbynett.no   
      
   On 31/10/2025 00:23, bart wrote:   
      
   > People into compilers are obsessed with optimisation. It can be a   
   > necessity for languages that generate lots of redundant code that needs   
   > to be cleaned up, but not so much for C.   
   >   
   > Typical differences of between -O0 and -O2 compiled code can be 2:1.   
   >   
   > However even the most terrible native code will be a magnitude faster   
   > than interpreted code.   
   >   
      
   You live in a world of x86 (with brief visits to 64-bit ARM).  You used   
   to work with smaller processors and lower level code, but seem to have   
   forgotten that long ago.   
      
   A prime characteristic of modern x86 processors is that they are   
   extremely good at running extremely bad code.  They are targeted at   
   systems where being able to run old binaries is essential.  A great deal   
   of the hardware in an x86 cpu core is there to handle poorly optimised   
   code - lots of jumps and function calls get predicted and speculated,   
   data that is pushed onto and pulled off the stack gets all kinds of fast   
   paths and short-circuits, and so on.  And then there is the memory - if   
   code has to wait for data from ram, the cpu can happily execute hundreds   
   of cycles of unnecessary unoptimised code without making any difference   
   to the final speed.   
      
   Big ARM processors - such as on Pi's - have the same effects, though to   
   a somewhat lesser extent.   
      
   A prime characteristic of user programs on PC's and other "big" systems   
   is that a lot of the time is spent doing things other than running the   
   user code - file I/O, screen display, OS calls, or code in static   
   libraries, DLLs (or SOs), etc.  That stuff is completely unaffected by   
   the efficiency of the user code - that's why interpreted or VM code is   
   fast enough for a very wide range of use-cases.   
      
   And if you are working with Windows systems with an MS DLL for the C   
   runtime library (as used by some C toolchains on Windows, but not all),   
   then you can get more distortions.  If you have a call to memcpy that   
   uses an external DLL, that is going to take perhaps 500 clock cycles   
   even for a small fixed size of memcpy (assuming all code and data is in   
   cache).  The user code for the call might be 10 cycles or 20 cycles   
   depending on the optimisation - compiler optimisation makes no   
   measurable difference here.  But if the toolchain uses a static library   
   for memcpy and can optimise locally to replace the call, the static call   
   to general memcpy code might take 200 cycles while the local code takes   
   10 cycles.  Suddenly the difference between optimising and   
   non-optimising is huge.   
      
   Then there is the type of code you are dealing with.  Some code is very   
   cpu intensive and can benefit from optimisations, other code is not.   
      
   And optimisation is not just a matter of choosing -O0 or -O2 flags.  It   
   can mean thought and changes in the source code (some standard C   
   changes, like use of "restrict" parameters, some compiler-specific   
   changes like gcc attributes or builtins, and some target specific like   
   organising data to fit cache usage).  And it can mean careful flag   
   choices - different specific optimisations suitable for the code at   
   hand, and target related flags for enabling more target features.  I am   
   entirely confident that you have done nothing of these things when   
   testing.  That's not necessarily a bad thing in itself, when looking at   
   widely portable source compiled to generic binaries, but it gives a very   
   unrealistic picture of compiler optimisations and what can be achieved   
   by someone who knows how to work with their compiler.   
      
      
   All this conspires to give you this 2:1 ratio that you regularly state   
   for the difference between optimised code and unoptimised code - gcc -O2   
   and gcc -O0.   
      
      
   In reality, people can often achieve far greater ratios for the type of   
   code where performance matters and where it is is achievable.  Someone   
   working on game engines on an x86 would probably expect at least 10   
   times difference between the flags they use, and no optimisation flags.   
   For the targets I use, which are (generally) not super-scaler,   
   out-of-order, etc., five to ten times difference is not uncommon.  And   
   when you throw C++ or other modern languages into the mix (remember, gcc   
   and clang/llvm are not simple C compilers), the benefits of inlining and   
   other inter-procedural optimisations can easily be an order of   
   magnitude.  (This is one reason why gcc and clang enable a number of   
   optimisations, including at least inlining of functions marked   
   appropriately, even with no optimisation flags specified.)   
      
      
   You can continue to believe that high-end toolchains are no more than   
   twice as good as your own compiler or tcc, if you like.  (And they give   
   you all the performance and features that you need, fine.)  Those of us   
   who want more from our tools, and know how to get it, know better.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]