... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 131,003 of 131,241
Paul Clayton to MitchAlsup
Re: Variable-length instructions (1/2)
08 Feb 26 17:48:13
   From: paaronclayton@gmail.com   
      
   On 2/5/26 4:27 PM, MitchAlsup wrote:   
   >   
   > MitchAlsup  posted:   
   >   
   >> Paul Clayton  posted:   
      
   [snip]   
   >>> LL-op-SC could be recognized as an idiom and avoid bringing data   
   >>> to the core.   
   >>   
   >> Can recognize:   
   >>   
   >>         LDL   Rd,[address]   
   >>         ADD   Rd,Rd,#whatever   
   >>         STC   Rd,[address]   
   >>   
   >> Cannot recognize:   
   >>   
   >>         LDA   R1,[address]   
   >>         CALL  LoadLocked   
   >>         ADD   R2,R2,#whatever   
   >>         CALL  StoreConditional   
      
   When would one want to decouple LL and SC into function calls   
   away from the computation? Perhaps for in-place software   
   instrumenation?   
      
   >>>> Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not   
   >>>> Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}   
   >>>   
   >>> I wonder if there is an issue of communicating intention to the   
   >>> computer. Using atomic-to-memory may be intended to communicate   
   >>> that the operation is expected to be under contention or that   
   >>> moderating the impact under high contention is more important   
   >>> than having a fast "happy path".   
   >>   
   >> There is a speed of light problem here. Communicating across a   
   >> computer  is a microsecond time problem, whereas executing   
   >> instructions is a nanosecond time problem.   
   >>   
   >> And this is exactly where Add-to-Memory gains over Interferable   
   >> ATOMIC events--you only pay the latency once, now while the latency   
   >> is higher than possible with LL-SC, it is WAY LOWER than worst case   
   >> with LL-SC under serious contention.   
      
   Yes. Even a single adder would have higher throughput than ping-   
   ponging a cache block. One might even support a three-or-more   
   input two-or-more result adder to improve throughtput (or   
   perhaps exploit usually smaller addends) to increase throughput,   
   though I suspect there would practically never be a case where a   
   simple adder would have insufficient throughput.   
      
   >>> This seems to be similar to branch hints and predication in that   
   >>> urging the computer to handle the task in a specific way may not   
   >>> be optimal for the goal of the user/programmer.   
      
   >> Explain   
      
   Branch hints can be intended to reduce branch predictor aliasing   
   (i.e., assume the static hint is used instead of a dynamic   
   predictor), to provide agree information, to prefer one path   
   even if it is less likely, to provide an initialization of the   
   (per-address component only?) branch predictor, or for some   
   other motive. The interface/architecture might not be specific   
   about how such information will be used, especially if it is a   
   hint (and programmers might disagree about what the best   
   interface would be). If the interface is not very specific, a   
   microarchitecture might violate a programmer's   
   desire/expectation by ignoring the hint or using it in a   
   different way.   
      
   Similarly predication can be motivated to avoid fetch   
   redirection (initial ARM and My 66000), to facilitate constant   
   time execution, to avoid the performance cost of branch   
   mispredictions, or perhaps for some reason that does not come to   
   mind. Predicate prediction would foil constant time execution   
   and might reduce performance (or merely introduce weird   
   performance variation). Even the fetch optimization might be   
   undone if the hardware discovers that the condition is extremely   
   biased and folds out the rarely used instructions; which would   
   be good for performance if the bias continues, but if the bias   
   changes just frequently enough it could hurt performance.   
      
   [snip]   
   >> There is no reason not to predict My 66000-style predication,   
   >> nor is there any great desire/need TO predict them, either.   
      
   Except that prediction could violate the time constancy assumed   
   by the programmer.   
      
   >>> If the requesting core has the cache block in exclusive or   
   >>> modified state, remote execution might be less efficient. Yet it   
   >   
   > In My 66000 architecture, an atomic operation on memory will have   
   > the property of being executed where the cache line in a modifiable   
   > state happens to reside. Here you don't want to "bring data in"   
   > nor do you want to "push data out" you want the memory hierarchy   
   > to be perturbed as little as possible WHILE performing the ATOMIC   
   > event.   
      
   Okay, that makes sense. The earlier statement implied (to me)   
   that the operation was always "centralized".   
      
   > The cost is that each cache will have an adder--with the sizes of   
   > cache these days, said adder is a trifling in area and seldom used   
   > so it is low power.   
   >   
   >>> may also be possible that the block is expected to be moved to   
   >>> another core such that this pushing the data manipulation to a   
   >>> "centralized" location would improve performance as was the   
   >>> programmer intent (rather than avoiding contention overhead). (I   
   >>> suppose in My 66000, a programmer would use a push/"prefetch"   
   >>> instruction to move the cache block to a more central location,   
   >>> but even that might be sub-optimal if the hardware can predict   
   >>> the next data consumer such that centrally located would be   
   >>> slower than shifting the data closer to the consumer.)   
   >>   
   >> I have been betting that (in the general case) software will   
   >> remain incapable of doing such predictions, for quite some time.   
      
   A bet with very good odds in general, but I am sure there are   
   still more than one "Mel" around who could optimize data   
   movement.   
      
   In cases like simple pipelines, the data communication pattern   
   is obvious. For an embedded system, communicating stores might   
   go directly to another cores local memory. With more general   
   purpose compute, thread migration would be more common (even   
   with more cores than runnable threads) and abstracting a   
   communication interface might be more challenging.   
      
   [snip]   
   >>> Clever microarchitecture can make some optimizations sub-optimal   
   >>> as well as cause programmers to ask "why didn't you do it the way   
   >>> I told you to do it?!"   
   >>   
   >> Instruction scheduling is in effect a optimization for one implementation   
   >> that may not hold for other implementations. In Order pipelines need   
   >> instruction scheduling, OoO do not, GBOoO generally perform better if   
   >> /when instructions are not (or lesser) scheduled.   
      
   Even OoO implementations can benefit from facilitating   
   instruction fusion and perhaps even scheduler allocation to   
   reduce communication overhead. A RAT-based renamer might also   
   benefit from having duplicate register names in the same rename   
   chunk. With in-order renaming, in theory a compiler could also   
   optimize RAT bank conflicts. This might not be considered   
   scheduling, but when the instruction stream is serial placement   
   is effectively scheduling.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]