From: david.brown@hesbynett.no   
      
   On 28/11/2025 12:12, Michael S wrote:   
   > On Fri, 28 Nov 2025 09:46:56 +0100   
   > David Brown wrote:   
   >   
   >> On 27/11/2025 23:15, Michael S wrote:   
   >>> On Thu, 27 Nov 2025 21:15:53 +0100   
   >>> David Brown wrote:   
   >>>   
   >>>> On 27/11/2025 15:02, Michael S wrote:   
   >>>>> On Thu, 27 Nov 2025 14:02:38 +0100   
   >>>>> David Brown wrote:   
   >>>>>   
   >>>>   
   >>>>>   
   >>>>> MSVC compilers compile your code and produce correct result, but   
   >>>>> the code   
   >>>>> looks less nice:   
   >>>>> 0000000000000000 :   
   >>>>> 0: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp)   
   >>>>> 6: 48 8b 44 24 08 mov 0x8(%rsp),%rax   
   >>>>> b: 48 c1 e8 34 shr $0x34,%rax   
   >>>>> f: 25 ff 07 00 00 and $0x7ff,%eax   
   >>>>> 14: c3 ret   
   >>>>>   
   >>>>> Although on old AMD processors it is likely faster than nicer code   
   >>>>> generated by gcc and clang. On newer processor gcc code is likely   
   >>>>> a bit better, but the difference is unlikely to be detected by   
   >>>>> simple measurements.   
   >>>>   
   >>>> I think it is unlikely that this version - moving from xmm0 to rax   
   >>>> via memory instead of directly - is faster on any processor. But I   
   >>>> fully agree that it is unlikely to be a measurable difference in   
   >>>> practice.   
   >>>   
   >>> I wonder, how do you have a nerve "to think" about things that you   
   >>> have absolutely no idea about?   
   >>   
   >> I think about many things - and these are things I /do/ know about.   
   >> But I don't know all the details, and am happy to be corrected and   
   >> learn more.   
   >>   
   >>>   
   >>> Instead of "thinking" you could just as well open Optimization   
   >>> Reference manuals of AMD Bulldozer family or of Bobcat. Or to read   
   >>> Agner Fog's instruction tables. Move from XMM to GPR on these   
   >>> processors is very slow: 8 clocks on BD, 7 on BbC.   
   >>>   
   >>   
   >> Okay. But storing data to memory from xmm0 is also going to be slow,   
   >> and loading it to rax from memory is going to be slow. I am not an   
   >> expert at the x86 world or reading Fog's tables, but it looks to me   
   >> that on a Bulldozer, storing from xmm0 to memory has a latency of 6   
   >> cycles and reading the memory into rax has a latency of 4 cycles.   
   >> That adds up to more than the 8 cycles for the direct register   
   >> transfer, and I expect (but do not claim to know for sure!) that the   
   >> dependency limits the scope for pipeline overlap - decode and address   
   >> calculations can be done, but the data can't be fetched until the   
   >> previous store is complete.   
   >>   
   >> So all in all, my estimate was, I think, quite reasonable. There may   
   >> be unusual circumstances on particular cores if the instruction   
   >> scheduling and pipelining, combined with the stack engine, make that   
   >> sequence faster than the single register move.   
   >>   
   >   
   > It seems, you are correct in this particular case.   
   > Latency tables, esp. those that are measured by software rather   
   > than supplied by designer, are problematic in case of moves between   
   > registers of different types, memory stores of all types and even   
   > memory loads, with exception of memory load into GPR. Agner explains why   
   > they are problematic in te preface to his tables. In short, there is no   
   > direct way to measure this things in isolation, so one has to measure   
   > latency of the sequence of instructions and then to apply either   
   > guesswork or manufacturer's docs to somehow divide the combined   
   > latency into individual parts.   
   >   
      
   Well, if even Agner thinks it is difficult, then I don't feel bad for   
   having trouble!   
      
   > So, the best way is to go by recommendations of the vendor in Opt.   
   > Reference Manual.   
   > There are no relevant recommendations for K8, unfortunately. I suspect   
   > that all methods are slow here.   
   > For Bobcat, there should be recommendations, but I don't have them and   
   > too lazy to look for.   
   >   
      
   Fair enough. It is not information that is likely to be useful to   
   anyone here, so it's all for fun and interest. I certainly wouldn't   
   want you to spend effort finding out the details just for me.   
      
   > For Family 10h (Barcelona and derivatives):   
   > "When moving data from a GPR to an MMX or XMM register, use separate   
   > store and load instructions to move the data first from the source   
   > register to a temporary location in memory and then from memory into   
   > the destination register, taking the memory latency into account when   
   > scheduling both stages of the load-store sequence.   
   >   
   > When moving data from an MMX or XMM register to a general-purpose   
   > register, use the MOVD instruction.   
   >   
   > Whenever possible, use loads and stores of the same data length. (See   
   > 5.3, ‘Store-to-Load Forwarding Restrictions” on page 74 for more   
   > information.)"   
      
   How much does advice like this take into account surrounding code?   
   That's what makes generating optimal code /really/ hard. And it means   
   micro-optimising a short instruction sequence can be ineffective for   
   real-world code. After all, no one is actually interested in minimising   
   the number of nanoseconds it takes to extract the exponent of a floating   
   point number - the speed only matters if you are doing lots of these,   
   probably in a big loop with data moving into and out of memory all the time.   
      
   This stuff was all /so/ much easier when we used PIC's and AVR's...   
      
   >   
   > For Family 15h (Bullozer and derivatives):   
   > "When moving data from a GPR to an XMM register, use separate store and   
   > load instructions to move the data first from the source register to a   
   > temporary location in memory and then from memory into the destination   
   > register, taking the memory latency into account when scheduling both   
   > stages of the load-store sequence.   
   >   
   > When moving data from an XMM register to a general-purpose register,   
   > use the VMOVD instruction.   
   >   
   > Whenever possible, use loads and stores of the same data length. (See   
   > 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more   
   > information.)"   
   >   
   > So, for both families, vendor recommends register move in direction from   
   > SIMD to GPR and Store/Load sequence in direction from GPR to SIMD.   
   > The suspect point here is specific mentioning of EVEX-encoded form   
   > (VMOVD) in case of BD. It can mean that "legacy" (SSE-encoded) form is   
   > slower or it can mean nothing. I suspect the latter.   
   >   
   >> I've now had a short look at the relevant table from Fog's site. My   
   >> conclusion from that is that the register move - though surprisingly   
   >> slow - is probably marginally faster than passing it through memory.   
   >> Perhaps if I spend enough time studying the details, I might find out   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|