... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 241,730 of 243,242
bart to David Brown
Re: New and improved version of cdecl (1
31 Oct 25 16:34:22
   From: bc@freeuk.com   
      
   On 31/10/2025 12:10, David Brown wrote:   
   > On 31/10/2025 00:23, bart wrote:   
   >   
   >> People into compilers are obsessed with optimisation. It can be a   
   >> necessity for languages that generate lots of redundant code that   
   >> needs to be cleaned up, but not so much for C.   
   >>   
   >> Typical differences of between -O0 and -O2 compiled code can be 2:1.   
   >>   
   >> However even the most terrible native code will be a magnitude faster   
   >> than interpreted code.   
   >>   
   >   
   > You live in a world of x86 (with brief visits to 64-bit ARM).  You used   
   > to work with smaller processors and lower level code, but seem to have   
   > forgotten that long ago.   
   >   
   > A prime characteristic of modern x86 processors is that they are   
   > extremely good at running extremely bad code.   
      
   Yes. And? That means compilers don't need to be so clever!   
      
      
   > They are targeted at   
   > systems where being able to run old binaries is essential.  A great deal   
   > of the hardware in an x86 cpu core is there to handle poorly optimised   
   > code - lots of jumps and function calls get predicted and speculated,   
   > data that is pushed onto and pulled off the stack gets all kinds of fast   
   > paths and short-circuits, and so on.  And then there is the memory - if   
   > code has to wait for data from ram, the cpu can happily execute hundreds   
   > of cycles of unnecessary unoptimised code without making any difference   
   > to the final speed.   
   >   
   > Big ARM processors - such as on Pi's - have the same effects, though to   
   > a somewhat lesser extent.   
   >   
   > A prime characteristic of user programs on PC's and other "big" systems   
   > is that a lot of the time is spent doing things other than running the   
   > user code - file I/O, screen display, OS calls, or code in static   
   > libraries, DLLs (or SOs), etc.  That stuff is completely unaffected by   
   > the efficiency of the user code - that's why interpreted or VM code is   
   > fast enough for a very wide range of use-cases.   
      
      
   Yes. That's why interpreted/dynamic languages (those usually go   
   together) are viable.   
      
   When I first introduced interpreted scripting to my apps (35 years ago),   
   I had a rough guideline in that an interpreted version of a task should   
   ideally be no worse than half the speed of 100% native code.   
      
   My everyday text-editor is interpreted, and I routinely edit 1-million   
   line files without noticing any lag.   
      
      
   > And if you are working with Windows systems with an MS DLL for the C   
   > runtime library (as used by some C toolchains on Windows, but not all),   
   > then you can get more distortions.  If you have a call to memcpy that   
   > uses an external DLL, that is going to take perhaps 500 clock cycles   
   > even for a small fixed size of memcpy (assuming all code and data is in   
   > cache).  The user code for the call might be 10 cycles or 20 cycles   
   > depending on the optimisation - compiler optimisation makes no   
   > measurable difference here.  But if the toolchain uses a static library   
   > for memcpy and can optimise locally to replace the call, the static call   
   > to general memcpy code might take 200 cycles while the local code takes   
   > 10 cycles.  Suddenly the difference between optimising and non-   
   > optimising is huge.   
      
   (My language has a 'clear' operator. Then inline code is generated for   
   fixed-size objects.)   
   >   
   > Then there is the type of code you are dealing with.  Some code is very   
   > cpu intensive and can benefit from optimisations, other code is not.   
   >   
   > And optimisation is not just a matter of choosing -O0 or -O2 flags.   
      
   To me, 'compiler'-optimisation means getting my program faster /without   
   changing the source/. All I want to do is either enable or disable the   
   option.   
      
   A lot of my optimisations are to do with design choices in my language,   
   special features it might provide, and design choices in the application.   
      
   Anything that can be done in the compiler is a bonus, but I don't rely   
   on it (other than the special case of generated C, see below).   
      
      
      
   >  It   
   > can mean thought and changes in the source code (some standard C   
   > changes, like use of "restrict" parameters, some compiler-specific   
   > changes like gcc attributes or builtins, and some target specific like   
   > organising data to fit cache usage).   
      
      
   >  And it can mean careful flag   
   > choices - different specific optimisations suitable for the code at   
   > hand, and target related flags for enabling more target features.   
      
   It sounds a lot of work. I used to just use inline assembly and be done   
   with it!   
      
   >  I am   
   > entirely confident that you have done nothing of these things when   
   > testing.  That's not necessarily a bad thing in itself, when looking at   
   > widely portable source compiled to generic binaries, but it gives a very   
   > unrealistic picture of compiler optimisations and what can be achieved   
   > by someone who knows how to work with their compiler.   
   >   
   >   
   > All this conspires to give you this 2:1 ratio that you regularly state   
   > for the difference between optimised code and unoptimised code - gcc -O2   
   > and gcc -O0.   
      
   If I'm giving figures that compare gcc-O0 to gcc-O2, then clearly,   
   everything else must remain the same. Otherwise why not compare two   
   entirely different algorithms while we're about it.   
      
   Anyway, I assume all that stuff you've mentioned has been incorporated   
   into the A68G makefiles, and it's still a pretty slow interpreter!   
   (Although probably the advanced features of the language don't help.)   
      
   However, one thing I did try the other day was to take the generated   
   makefile, and change the -O2 flag to -O0. Building it was a little   
   faster (60s instead of 90s), but my benchmark ran in 13s instead of 5s,   
   so 2.6:1.   
      
   You seem to be suggesting the difference should be greater, but this is   
   someone else's codebase, and someone else's set of compiler flags, other   
   than the choice of -O0/-O2.   
      
   So, while I understand what you're saying, that doesn't apply if you are   
   building, running and measuring an existing codebase created by someone   
   else.   
      
   I *am* seeing figures of 2:1, or sometimes 3:1 or 4:1; the latter   
   usually when someone is trying to be too clever with intensive use of   
   macros that may hide too many nested function, so that it needs inlining   
   to get a respectable speed.   
      
      
   >   
   > In reality, people can often achieve far greater ratios for the type of   
   > code where performance matters and where it is is achievable.  Someone   
   > working on game engines on an x86 would probably expect at least 10   
   > times difference between the flags they use, and no optimisation flags.   
   > For the targets I use, which are (generally) not super-scaler, out-of-   
   > order, etc., five to ten times difference is not uncommon.   
      
   For the /applications/ I write (not silly benchmarks), and for x64, 2:1   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]