From: anton@mips.complang.tuwien.ac.at   
      
   Michael S writes:   
   >On Tue, 30 Dec 2025 17:27:22 GMT   
   >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
   >>   
   >> I did not write anything about the clue of Apple. I don't know much   
   >> about the CPUs by Nvidia and Fujitsu. But if there was significant   
   >> performance to be had by adding a weakly-ordered mode, wouldn't   
   >> especially Fujitsu with its supercomputer target have done it?   
   >>   
   >   
   >Fujitsu had very strong reason to implement TSO on A64FX - source-level   
   >compatibility with SPARC64 VIIIfx and XIfx.   
      
   Apple also had good reason to implement TSO on M1: AMD64->ARM A64   
   binary translation (Rosetta). They chose to add a slower TSO mode to   
   their weak memory system, which is not surprising given that they had   
   a working weak memory system, and it is relatively easy to implement   
   TSO on that (with a performance penalty).   
      
   >I wouldn't be surprised if apart from that they have SPARC->ARM Rosetta   
   >for some customers, but that's relatively minor factor. Supercomputer   
   >users are considered willing to recompile their code. But much less   
   >willing to re-write it.   
      
   While supercomputer users may not be particularly willing to rewrite   
   their code, they are much more willing than anyone else, because in   
   supercomputing, hardware cost still is higher than software cost.   
      
   If there was an easy way to offer "5-10% more performance" to those   
   users willing to write or use software written for weak memory models   
   by adding a weak memory mode to A64FX, I would be very surprised if   
   they would have passed. So I conclude that it's not easy to turn   
   their memory model into a weak one and gain performance.   
      
   Concerning their SPARC implementations: The SPARC architecture   
   specifies both TSO and a weak memory model. Does your comment about   
   SPARC64 VIIIfx and XIfx mean that Fujitsu only implemented TSO on   
   those CPUs and that when you asked for the weak mode on those CPUs,   
   you still got TSO? That would be the counterexample that Thomas   
   Koenig asked for.   
      
   >Besides, as I mentioned in my other post, A64fx memory subsystem is slow   
   >(latency-wise, throughput wise it is very good).   
      
   Sounds to me like it is designed for a supercomputer.   
      
   >I don't know what   
   >influence that fact has, but I can hand-wave that it shifts the balance   
   >of cost toward TSO.   
      
   Can you elaborate on that?   
      
   >Also, cache lines are unusually wide (256B), so it   
   >is possible that RFO shortcuts allowed by weaker MOM are less feasible.   
      
   Why should that be?   
      
   A particular aspect here is that RFO is rare in applications with good   
   temporal locality. Supercomputer applications tend to have relatively   
   bad temporal locality and will see RFO more often.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
    Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|