... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 131,001 of 131,241
Paul Clayton to MitchAlsup
Re: store to wide load forwarding
08 Feb 26 15:11:44
   From: paaronclayton@gmail.com   
      
   On 2/4/26 8:48 PM, MitchAlsup wrote:   
   >   
   > anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >   
   >> I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)   
   >> from gcc's auto-vectorization for the bubble-sort benchmark of John   
   >> Hennessy's collection of small integer benchmarks.   
   >   
   > Have you considered how hard it is to track part of a register's   
   > contents {RATHER THAN "In Its Entirety"} ??   
      
   If half-register concatenation is not supported, I think one   
   could avoid a significant amount of complexity. Even supporting   
   concatenation only for stores might be similar to ARM's store   
   pair support. If a RAT is used, one would double the number of   
   entries. If full-sized values are allocated two distinct half-   
   sized physical registers (to reduce physical size), they would   
   be similar to paired register operations. If half-sized values   
   are allocated to full-sized registers, a little simplification   
   might be possible. Allocating and freeing twice as many physical   
   registers per cycle seems rather painful.   
      
   With full-sized registers, the lower of a pair of RAT entries   
   could be used for full-sized and lower-half operands allowing   
   concatenation errors to be detected when a full-sized operand's   
   lower-half RAT entry does not match the upper half (or the upper   
   half if not zero depending on whether one writes full values by   
   replication or "zero extension").   
      
   If the point is to provide more named values, concatenation does   
   not seem to be important.   
      
   > Consider a string of x86 instructions that write to different bytes   
   > of the same register.   
   > a) do you want to blow-up the forwarding path by 4× ??   
   > b) do you want each forwarding path-portion to select between   
   >     4 places in any result ??   
      
   If one limits supported operations to add/subtract and the   
   bitwise logical operations, it seems that one could handle the   
   operations that only use upper or lower halfs for sources and   
   destinations without much extra logic. Operations with upper and   
   lower sources and an upper destination might only need to   
   convert the operations to a shift-by-32-and-operate of the lower   
   operand and perhaps some of the time for carry propagation could   
   be taken to cover the shift latency? Mixed inputs and lower   
   output would not be able to hide extra shift latency; not   
   providing such operations or having such have two-cycle latency   
   to allow a post-calculation 32-bit shift might be acceptable.   
      
   (Variable shift might be allowed if the variable is in a lower   
   half.)   
      
   > My guess is no. Thus, quit using partial registers and get on with life.   
      
   I have certainly not thought deeply about the costs and probably   
   lack sufficient knowledge of hardware design to make reasonable   
   estimates. Yet for z/ARchitecture IBM chose to support half-   
   sized operations to increase the number of independent values   
   that can be in registers, admittedly because of the encoding   
   constraint of a legacy architecture. If I recall correctly, AMD   
   also chose to support double-"native" width of SIMD by using   
   twice as many operands, which is a purely microarchitectural   
   choice prioritizing resources for "native" width while   
   supporting instructions with twice the SIMD width.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]