... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 242,202 of 243,242
David Brown to Michael S
Re: _BitInt(N)
28 Nov 25 09:46:56
   From: david.brown@hesbynett.no   
      
   On 27/11/2025 23:15, Michael S wrote:   
   > On Thu, 27 Nov 2025 21:15:53 +0100   
   > David Brown  wrote:   
   >   
   >> On 27/11/2025 15:02, Michael S wrote:   
   >>> On Thu, 27 Nov 2025 14:02:38 +0100   
   >>> David Brown  wrote:   
   >>>   
   >>   
   >>>   
   >>> MSVC compilers compile your code and produce correct result, but the   
   >>> code   
   >>> looks less nice:   
   >>> 0000000000000000 :   
   >>>      0:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%rsp)   
   >>>      6:   48 8b 44 24 08          mov    0x8(%rsp),%rax   
   >>>      b:   48 c1 e8 34             shr    $0x34,%rax   
   >>>      f:   25 ff 07 00 00          and    $0x7ff,%eax   
   >>>     14:   c3                      ret   
   >>>   
   >>> Although on old AMD processors it is likely faster than nicer code   
   >>> generated by gcc and clang. On newer processor gcc code is likely a   
   >>> bit better, but the difference is unlikely to be detected by simple   
   >>> measurements.   
   >>   
   >> I think it is unlikely that this version - moving from xmm0 to rax   
   >> via memory instead of directly - is faster on any processor.  But I   
   >> fully agree that it is unlikely to be a measurable difference in   
   >> practice.   
   >   
   > I wonder, how do you have a nerve "to think" about things that you have   
   > absolutely no idea about?   
      
   I think about many things - and these are things I /do/ know about.  But   
   I don't know all the details, and am happy to be corrected and learn more.   
      
   >   
   > Instead of "thinking" you could just as well open Optimization   
   > Reference manuals of AMD Bulldozer family or of Bobcat. Or to read   
   > Agner Fog's instruction tables. Move from XMM to GPR on these   
   > processors is very slow: 8 clocks on BD, 7 on BbC.   
   >   
      
   Okay.  But storing data to memory from xmm0 is also going to be slow,   
   and loading it to rax from memory is going to be slow.  I am not an   
   expert at the x86 world or reading Fog's tables, but it looks to me that   
   on a Bulldozer, storing from xmm0 to memory has a latency of 6 cycles   
   and reading the memory into rax has a latency of 4 cycles.  That adds up   
   to more than the 8 cycles for the direct register transfer, and I expect   
   (but do not claim to know for sure!) that the dependency limits the   
   scope for pipeline overlap - decode and address calculations can be   
   done, but the data can't be fetched until the previous store is complete.   
      
   So all in all, my estimate was, I think, quite reasonable.  There may be   
   unusual circumstances on particular cores if the instruction scheduling   
   and pipelining, combined with the stack engine, make that sequence   
   faster than the single register move.   
      
   I've now had a short look at the relevant table from Fog's site.  My   
   conclusion from that is that the register move - though surprisingly   
   slow - is probably marginally faster than passing it through memory.   
   Perhaps if I spend enough time studying the details, I might find out   
   more and discover that I was wrong.  But that would be an extraordinary   
   effort to learn about a meaningless little detail of a long-gone processor.   
      
   I am also fairly confident that the function as a whole will be faster   
   with the register move since you will get better overlap and   
   superscaling with the call and return sequence when the instructions in   
   the middle don't access the stack.   
      
   Of curiosity, I compiled the code with gcc and "-march=bdver1", which I   
   believe is the correct flag for that processor.  It generated the   
   register move version, but with a "vmovq" instruction instead of "movq".   
     I don't know if there is any difference there - x86 instruction naming   
   seems to have a certain degree of variance.  (gcc's models of   
   scheduling, pipelining and timing for processors is far from perfect,   
   but the gcc folks do study Agner Fog's publications as well as having   
   contributors from AMD and Intel.)   
      
   More interesting, however, was that with "-march=bdver2" (up to bdver4)   
   gcc changed the "shr / and" sequence to a single "bextr" instruction.  I   
   didn't see that on other -march choices.  It seems the two instruction   
   shift-and-mask is faster than a single bit extract instruction on most   
   x86 processors.   
      
   All in all, it is a lesson on how small details of architectures can   
   make a difference.   
      
   > BTW, AMD K8 has the opposite problem. Move from XMM to GPR is reasonably   
   > fast, but move from GPR to XMM is painfully slow.   
   >   
   > On the other hand, moves "via memory" are reasonably fast on these   
   > CPUs (except, may be, Bobcat? I am not sure about it), because data   
   > does not really travels through memory or through cache. Load-store   
   > forwarding picks the data directly from the store queue.   
   >   
      
   Yes, and there can be even more specialised short-cuts for stack data.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]