... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 131,002 of 131,241
Paul Clayton to Anton Ertl
Re: Accepting the Sense of Some of Mitch
08 Feb 26 16:42:56
   From: paaronclayton@gmail.com   
      
   On 12/19/25 12:41 PM, Anton Ertl wrote:   
   > John Savard  writes:   
   >> On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:   
   >>> Or in other words, if you can decode K-instructions per cycle, you'd   
   >>> better be able to execute K-instructions per cycle--or you have a   
   >>> serious blockage in your pipeline.   
   >>   
   >> No.   
   >>   
   >> If you flipped "decode" and "execute" in that sentence above, I would 100%   
   >> agree. And maybe this _is_ just a typo.   
   >>   
   >> But if you actually did mean that sentence exactly as written, I would   
   >> disagree. This is why: I regard executing instructions as 'doing the   
   >> actual work' and decoding instructions as... some unfortunate trivial   
   >> overhead that can't be avoided.   
   >   
   > It does not matter what "the actual work" is and what isn't.  What   
   > matters is how expensive it is to make a particular part wider, and   
   > how paying that cost benefits the IPC.  At every step you add width to   
   > the part with the best benefit/cost ratio.   
   >   
   > And looking at recent cores, we see that, e.g., Skymont can decode   
   > 3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to   
   > functional units (i.e., can execute 26 uops in one cycle); I don't   
   > know how many instructions it can retire per cycle, but I expect that   
   > it is more than 8 per cycle.   
   >   
   > So the renamer is the bottleneck, and that's also the idea behind   
   > top-down microarchitecture analysis (TMA) for determining how software   
   > interacts with the microarchitecture.  That idea is coming out of   
   > Intel, but if Intel is finding it hard to make wider renamers rather   
   > than wider other parts, I expect that the rest of the industry also   
   > finds that hard (especially for architectures where decoding is   
   > cheaper), and (looking at ARM A64) where instructions with more   
   > demands on the renamer exist.   
      
   It is not clear to me that the renamer is clearly the bottleneck   
   merely because it is narrower. Wider rename might not increase   
   performance that much. The 9-instruction decode is a consequence   
   of using three similar 3-wide decoders with predicting three   
   targets (so taken branches can be handled). If the target   
   predictor for straightline code is for each three instructions,   
   then all the decoders should be the same width. Avoiding 4-wide   
   decoders makes sense both to support more taken branches and to   
   avoid the extra instruction length determination complexity.   
      
   (Since rename is 8 µops and decode is 9 _instructions_ the width   
   difference is greater than it might seem from the numbers. I   
   think x86 designs still have more instruction cracking than   
   instruction fusion.)   
      
   A component can be small because it is expensive to make it   
   larger or because there is less benefit to making it larger. I   
   would consider the former a bottleneck for the designers, but   
   the later seems to imply that other constraints should be   
   loosened before more design effort is applied to that component.   
      
   I have not been paying attention to achieved IPC, but I got the   
   impression that for messy "integer" code achieved IPC was not up   
   to 4. While the front end needs to process more instructions   
   than are committed due to branch mispredictions, it seems at   
   least plausible that wider rename is not that helpful generally.   
      
   Of course, supporting bursts makes sense — otherwise execute   
   would not be so wide nor retirement — but more order-constrained   
   front-end may not benefit as much from enabling bursts of high   
   activity.   
      
   With banking and replication a lot of RAT read and write ports   
   could be supported and if rename-width internal forwarding   
   avoids RAT access (instead transferring from the free list) the   
   port count could be further reduced. Merging reused source names   
   could also reduce RAT port count.   
      
   I am not suggesting that increasing rename width is trivial; it   
   seems somewhat similar to the dependency tracking for in-order   
   superscalar execution, which is part of what motivated VLIW.   
      
   (In theory, dependent operations could have higher rename   
   latency because they cannot execute for at least one cycle after   
   the source value providing operation. I do not recall reading   
   any proposal to exploit such and it would increase the   
   complexity of filling the scheduler. Perhaps such variable   
   rename latency might be useful in a banked RAT where bank   
   conflicts could delay renaming. Preferring non-dependent (and   
   older) operations might reduce the impact of bank contention.)   
      
   I *suspect* that there is some architectural and   
   microarchitectural opportunity to reduce the overhead of   
   renaming, checkpointing, and other out-of-order execution   
   mechanisms. Caching dependency information in a decoded   
   instruction cache might be helpful, e.g.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]