home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 130,985 of 131,241   
   MitchAlsup to All   
   Re: Variable-length instructions   
   05 Feb 26 21:27:51   
   
   From: user5857@newsgrouper.org.invalid   
      
   MitchAlsup  posted:   
      
   >   
   > Paul Clayton  posted:   
   >   
   -----------------   
   > > >> Good point. Humm. Well, I just don't want to see the disassembly of   
   > > >> atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)   
   > > >   
   > > > If you do it LL/SC-style you HAVE to bring data to "this" particular   
   > > > CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under   
   > > > contention. So you DON"T DO IT LIKE THAT.   
   > >   
   > > LL-op-SC could be recognized as an idiom and avoid bringing data   
   > > to the core.   
   >   
   > Can recognize:   
   >   
   >        LDL   Rd,[address]   
   >        ADD   Rd,Rd,#whatever   
   >        STC   Rd,[address]   
   >   
   > Cannot recognize:   
   >   
   >        LDA   R1,[address]   
   >        CALL  LoadLocked   
   >        ADD   R2,R2,#whatever   
   >        CALL  StoreConditional   
   >   
   > > > Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not   
   > > > Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}   
   > >   
   > > I wonder if there is an issue of communicating intention to the   
   > > computer. Using atomic-to-memory may be intended to communicate   
   > > that the operation is expected to be under contention or that   
   > > moderating the impact under high contention is more important   
   > > than having a fast "happy path".   
   >   
   > There is a speed of light problem here. Communicating across a   
   > computer  is a microsecond time problem, whereas executing   
   > instructions is a nanosecond time problem.   
   >   
   > And this is exactly where Add-to-Memory gains over Interferable   
   > ATOMIC events--you only pay the latency once, now while the latency   
   > is higher than possible with LL-SC, it is WAY LOWER than worst case   
   > with LL-SC under serious contention.   
   >   
   > > This seems to be similar to branch hints and predication in that   
   > > urging the computer to handle the task in a specific way may not   
   > > be optimal for the goal of the user/programmer.   
   > Explain   
   > >                                                 A programmer   
   > > might use predication to avoid a branch that is expected to be   
   > > poorly predicted or to have more consistent execution time. The   
   >   
   > My 66000 predication can avoid 2 branches--it operates under the   
   > notion that if FETCH reaches the join point before condition is   
   > known, then predication is always faster than branching.   
   >   
   > > former could be inappropriate for the computer to obey if the   
   > > branch predictor became effective for that branch. If prediction   
   > > is accurate, predicate prediction could improve performance but   
   > > would break execution time consistency. Even reducing execution   
   > > time when the predicate is known early might go against the   
   > > programmer's intent by leaking information.   
   >   
   > There is no reason not to predict My 66000-style predication,   
   > nor is there any great desire/need TO predict them, either.   
   >   
   > > If the requesting core has the cache block in exclusive or   
   > > modified state, remote execution might be less efficient. Yet it   
      
   In My 66000 architecture, an atomic operation on memory will have   
   the property of being executed where the cache line in a modifiable   
   state happens to reside. Here you don't want to "bring data in"   
   nor do you want to "push data out" you want the memory hierarchy   
   to be perturbed as little as possible WHILE performing the ATOMIC   
   event.   
      
   The cost is that each cache will have an adder--with the sizes of   
   cache these days, said adder is a trifling in area and seldom used   
   so it is low power.   
      
   > > may also be possible that the block is expected to be moved to   
   > > another core such that this pushing the data manipulation to a   
   > > "centralized" location would improve performance as was the   
   > > programmer intent (rather than avoiding contention overhead). (I   
   > > suppose in My 66000, a programmer would use a push/"prefetch"   
   > > instruction to move the cache block to a more central location,   
   > > but even that might be sub-optimal if the hardware can predict   
   > > the next data consumer such that centrally located would be   
   > > slower than shifting the data closer to the consumer.)   
   >   
   > I have been betting that (in the general case) software will   
   > remain incapable of doing such predictions, for quite some time.   
   >   
   > > If the contention is from false sharing (having multiple atomic   
   > > data in a cache block seems to be considered bad programming   
   > > practice, so this should not be common unless cache block size   
   > > grows), hardware could theoretically provide special word caches   
   > > (or "lossy" block compression where part of the block is dropped)   
   > > for moderating the impact of false sharing. This would change the   
   > > optimization preferences for the program (more compact data might   
   > > be preferred if false sharing is less of a problem).   
   > >   
   > > I do not know what the best interface would be, but it seems that   
   > > some care should be taken to account for differing intent when a   
   > > programmer suggests a specific mechanism. This also gets into the   
   > > distinction/spectrum/space between a hint and a directive. Both   
   > > hints and directives can have unexpected performance changes   
   > > under different microarchitectures or different usage.   
   > >   
   > > Clever microarchitecture can make some optimizations sub-optimal   
   > > as well as cause programmers to ask "why didn't you do it the way   
   > > I told you to do it?!"   
   >   
   > Instruction scheduling is in effect a optimization for one implementation   
   > that may not hold for other implementations. In Order pipelines need   
   > instruction scheduling, OoO do not, GBOoO generally perform better if   
   > /when instructions are not (or lesser) scheduled.   
   >   
   > Loop unrolling is a case where if your machine has vVM the code   
   > runs faster when not {unrolled and executed with branch prediction}   
   > than when {}.   
   >   
   > > (I think I may not be communicating well. I am kind of tired   
   > > right now and this topic is complicated.)   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca