From: user5857@newsgrouper.org.invalid   
      
   anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
      
   > MitchAlsup writes:   
   > >   
   > >anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   > >> Memory-ordering shenanigans come from the unholy alliance of   
   > >> cache-coherent multiprocessing and the supercomputer attitude.   
   > >   
   > >And without the SuperComputer attitude, you sell 0 parts.   
   > >{Remember how we talk about performance all the time here ?}   
   >   
   > Wrong. The supercomputer attitude gave us such wonders as IA-64   
   > (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not   
   > only easier to program, but also faster.   
   >   
   > The advocates of weaker memory models justify them by pointing to the   
   > slowness of sequential consistency if one implements it by using   
   > fences on hardware optimized for a weaker memory model. But that's   
   > not the way to implement efficient sequential consistency.   
   >   
   > In an alternate reality where AMD64 did not happen and IA-64 won,   
   > people would justify the IA-64 ISA complexity as necessary for   
   > performance, and claim that the IA-32 hardware in the Itanium   
   > demonstrates the performance superiority of the EPIC approach, just   
   > like they currently justify the performance superiority of weak and   
   > "strong" memory models over sequential consistency.   
   >   
   > If hardware designers put their mind to it, they could make sequential   
   > consistency perform well,   
      
   Depends on your definition of SC and "performs well", but see below::   
      
   > probably better on code that actually   
   > accesses data shared between different threads than weak and "strong"   
   > ordering, because there is no need to slow down the program with   
   > fences and the like in cases where only one thread accesses the data,   
   > and in cases where the data is read by all threads. You will see the   
   > slowdown only in run-time cases when one thread writes and another   
   > reads in temporal proximity. And all the fences etc. that are   
   > inserted just in case would also become fast (noops).   
      
   In the case of My 66000, there is a slightly weak memory model   
   (Causal consistency) for accesses to DRAM, and there is Sequential   
   consistency for ATOMIC stuff and device control registers, and then   
   there is strongly ordered for configuration space access, and the   
   programmer does not have to do "jack" to get these orderings--   
   its all programmed in the PTEs.   
      
   {{There is even a way to make DRAM accesses SC should you want.}}   
      
   > A similar case: Alpha includes a trapb instruction (an exception   
   > fence). Programmers have to insert it after FP instructions to get   
   > precise exceptions. This was justified with performance; i.e., the   
   > theory went: If you compile without trapb, you get performance and   
   > imprecise exceptions, if you compile with trapb, you get slowness and   
   > precise exceptions. I then measured SPEC 95 compiled without and with   
   > trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264   
   > there was hardly any difference; I believe that trapb is a noop on the   
   > 21264. Here's the SPECfp_base95 numbers:   
   >   
   > with without   
   > trapb trapb   
   > 9.56 11.6 AlphaPC164LX 600MHz 21164A   
   moderate slowdown   
   > 19.7 20.0 Compaq XP1000 500MHz 21264   
   slowdown has disappeared.   
   >   
   > So the machine that needs trapb is much slower even without trapb than   
   > even the with-trapb variant on the machine where trapb is probably a   
   > noop. And lots of implementations of architectures without trapb have   
   > demonstrated since then that you can have high performance and precise   
   > exceptions without trapb.   
   >   
   > >And only after several languages built their own ATOMIC primitives, so   
   > >the programmers could remain ignorant. But this also ties the hands of   
   > >the designers in such a way that performance grows ever more slowly   
   > >with more threads.   
   >   
   > Maybe they could free their hands by designing for a   
   > sequential-consistency interface, just like designing for a simple   
   > sequential-execution model without EPIC features freed their hands to   
   > design microarchitectural features that allowed ordinary code to   
   > utilize wider and wider OoO cores profitably.   
      
   That is not the property I was getting at--the property I was getting at   
   is that the language model for synchronization can only use 1 memory   
   location {TS, TTS, CAS, DCAS, LL, SC} and this fundamentally limits the   
   amount of work one can do in a single event, and also fundamentally limits   
   what one can "say" about a concurrent data structure.   
      
   Given a certain amount of interference--the fewer ATOMIC things one has   
   to do the lower the chance of interference, and the greater the chance   
   of success. So, if one could move an element of a CDS from one location   
   to another in one ATOMIC event rather than 2 (or 3) then the exponent   
   of synchronization overhead goes down, and then one can make statements   
   like "and no outside observer can see the CDS without that element   
   present"--which cannot be stated with current models.   
      
   >   
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|