From: already5chosen@yahoo.com   
      
   On Fri, 28 Nov 2025 09:46:56 +0100   
   David Brown wrote:   
      
   > On 27/11/2025 23:15, Michael S wrote:   
   > > On Thu, 27 Nov 2025 21:15:53 +0100   
   > > David Brown wrote:   
   > >    
   > >> On 27/11/2025 15:02, Michael S wrote:    
   > >>> On Thu, 27 Nov 2025 14:02:38 +0100   
   > >>> David Brown wrote:   
   > >>>    
   > >>    
   > >>>   
   > >>> MSVC compilers compile your code and produce correct result, but   
   > >>> the code   
   > >>> looks less nice:   
   > >>> 0000000000000000 :   
   > >>> 0: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp)   
   > >>> 6: 48 8b 44 24 08 mov 0x8(%rsp),%rax   
   > >>> b: 48 c1 e8 34 shr $0x34,%rax   
   > >>> f: 25 ff 07 00 00 and $0x7ff,%eax   
   > >>> 14: c3 ret   
   > >>>   
   > >>> Although on old AMD processors it is likely faster than nicer code   
   > >>> generated by gcc and clang. On newer processor gcc code is likely   
   > >>> a bit better, but the difference is unlikely to be detected by   
   > >>> simple measurements.    
   > >>   
   > >> I think it is unlikely that this version - moving from xmm0 to rax   
   > >> via memory instead of directly - is faster on any processor. But I   
   > >> fully agree that it is unlikely to be a measurable difference in   
   > >> practice.    
   > >    
   > > I wonder, how do you have a nerve "to think" about things that you   
   > > have absolutely no idea about?    
   >    
   > I think about many things - and these are things I /do/ know about.   
   > But I don't know all the details, and am happy to be corrected and   
   > learn more.   
   >    
   > >    
   > > Instead of "thinking" you could just as well open Optimization   
   > > Reference manuals of AMD Bulldozer family or of Bobcat. Or to read   
   > > Agner Fog's instruction tables. Move from XMM to GPR on these   
   > > processors is very slow: 8 clocks on BD, 7 on BbC.   
   > >    
   >    
   > Okay. But storing data to memory from xmm0 is also going to be slow,    
   > and loading it to rax from memory is going to be slow. I am not an    
   > expert at the x86 world or reading Fog's tables, but it looks to me   
   > that on a Bulldozer, storing from xmm0 to memory has a latency of 6   
   > cycles and reading the memory into rax has a latency of 4 cycles.   
   > That adds up to more than the 8 cycles for the direct register   
   > transfer, and I expect (but do not claim to know for sure!) that the   
   > dependency limits the scope for pipeline overlap - decode and address   
   > calculations can be done, but the data can't be fetched until the   
   > previous store is complete.   
   >    
   > So all in all, my estimate was, I think, quite reasonable. There may   
   > be unusual circumstances on particular cores if the instruction   
   > scheduling and pipelining, combined with the stack engine, make that   
   > sequence faster than the single register move.   
   >    
      
   It seems, you are correct in this particular case.   
   Latency tables, esp. those that are measured by software rather   
   than supplied by designer, are problematic in case of moves between   
   registers of different types, memory stores of all types and even   
   memory loads, with exception of memory load into GPR. Agner explains why   
   they are problematic in te preface to his tables. In short, there is no   
   direct way to measure this things in isolation, so one has to measure   
   latency of the sequence of instructions and then to apply either   
   guesswork or manufacturer's docs to somehow divide the combined   
   latency into individual parts.   
      
   So, the best way is to go by recommendations of the vendor in Opt.   
   Reference Manual.   
   There are no relevant recommendations for K8, unfortunately. I suspect   
   that all methods are slow here.   
   For Bobcat, there should be recommendations, but I don't have them and   
   too lazy to look for.   
      
   For Family 10h (Barcelona and derivatives):   
   "When moving data from a GPR to an MMX or XMM register, use separate   
   store and load instructions to move the data first from the source   
   register to a temporary location in memory and then from memory into   
   the destination register, taking the memory latency into account when   
   scheduling both stages of the load-store sequence.   
      
   When moving data from an MMX or XMM register to a general-purpose   
   register, use the MOVD instruction.   
      
   Whenever possible, use loads and stores of the same data length. (See   
   5.3, ‘Store-to-Load Forwarding Restrictions” on page 74 for more   
   information.)"   
      
   For Family 15h (Bullozer and derivatives):   
   "When moving data from a GPR to an XMM register, use separate store and   
   load instructions to move the data first from the source register to a   
   temporary location in memory and then from memory into the destination   
   register, taking the memory latency into account when scheduling both   
   stages of the load-store sequence.    
      
   When moving data from an XMM register to a general-purpose register,   
   use the VMOVD instruction.    
      
   Whenever possible, use loads and stores of the same data length. (See   
   6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more   
   information.)"   
      
   So, for both families, vendor recommends register move in direction from   
   SIMD to GPR and Store/Load sequence in direction from GPR to SIMD.   
   The suspect point here is specific mentioning of EVEX-encoded form   
   (VMOVD) in case of BD. It can mean that "legacy" (SSE-encoded) form is   
   slower or it can mean nothing. I suspect the latter.   
      
   > I've now had a short look at the relevant table from Fog's site. My    
   > conclusion from that is that the register move - though surprisingly    
   > slow - is probably marginally faster than passing it through memory.    
   > Perhaps if I spend enough time studying the details, I might find out    
   > more and discover that I was wrong. But that would be an   
   > extraordinary effort to learn about a meaningless little detail of a   
   > long-gone processor.   
   >    
   > I am also fairly confident that the function as a whole will be   
   > faster with the register move since you will get better overlap and    
   > superscaling with the call and return sequence when the instructions   
   > in the middle don't access the stack.   
   >    
   > Of curiosity, I compiled the code with gcc and "-march=bdver1", which   
   > I believe is the correct flag for that processor. It generated the    
   > register move version, but with a "vmovq" instruction instead of   
   > "movq". I don't know if there is any difference there - x86   
   > instruction naming seems to have a certain degree of variance.   
   > (gcc's models of scheduling, pipelining and timing for processors is   
   > far from perfect, but the gcc folks do study Agner Fog's publications   
   > as well as having contributors from AMD and Intel.)   
   >    
   > More interesting, however, was that with "-march=bdver2" (up to   
   > bdver4) gcc changed the "shr / and" sequence to a single "bextr"   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|