... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,424 of 131,241
EricP to Anton Ertl
Re: Memory ordering (Re: Multi-precision
01 Dec 25 14:07:34
   From: ThatWouldBeTelling@thevillage.com   
      
   Anton Ertl wrote:   
   > MitchAlsup  writes:   
   >> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>> Memory-ordering shenanigans come from the unholy alliance of   
   >>> cache-coherent multiprocessing and the supercomputer attitude.   
   >> And without the SuperComputer attitude, you sell 0 parts.   
   >> {Remember how we talk about performance all the time here ?}   
   >   
   > Wrong.  The supercomputer attitude gave us such wonders as IA-64   
   > (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not   
   > only easier to program, but also faster.   
   >   
   > The advocates of weaker memory models justify them by pointing to the   
   > slowness of sequential consistency if one implements it by using   
   > fences on hardware optimized for a weaker memory model.  But that's   
   > not the way to implement efficient sequential consistency.   
   >   
   > In an alternate reality where AMD64 did not happen and IA-64 won,   
   > people would justify the IA-64 ISA complexity as necessary for   
   > performance, and claim that the IA-32 hardware in the Itanium   
   > demonstrates the performance superiority of the EPIC approach, just   
   > like they currently justify the performance superiority of weak and   
   > "strong" memory models over sequential consistency.   
   >   
   > If hardware designers put their mind to it, they could make sequential   
   > consistency perform well, probably better on code that actually   
   > accesses data shared between different threads than weak and "strong"   
   > ordering, because there is no need to slow down the program with   
   > fences and the like in cases where only one thread accesses the data,   
   > and in cases where the data is read by all threads.  You will see the   
   > slowdown only in run-time cases when one thread writes and another   
   > reads in temporal proximity.  And all the fences etc. that are   
   > inserted just in case would also become fast (noops).   
   >   
   > A similar case: Alpha includes a trapb instruction (an exception   
   > fence).  Programmers have to insert it after FP instructions to get   
   > precise exceptions.  This was justified with performance; i.e., the   
   > theory went: If you compile without trapb, you get performance and   
   > imprecise exceptions, if you compile with trapb, you get slowness and   
   > precise exceptions.  I then measured SPEC 95 compiled without and with   
   > trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264   
   > there was hardly any difference; I believe that trapb is a noop on the   
   > 21264.  Here's the SPECfp_base95 numbers:   
   >   
   > with     without   
   > trapb    trapb   
   > 9.56     11.6    AlphaPC164LX 600MHz 21164A   
   > 19.7     20.0    Compaq XP1000 500MHz 21264   
   >   
   > So the machine that needs trapb is much slower even without trapb than   
   > even the with-trapb variant on the machine where trapb is probably a   
   > noop.  And lots of implementations of architectures without trapb have   
   > demonstrated since then that you can have high performance and precise   
   > exceptions without trapb.   
      
   The 21264 Hardware Reference Manual says TRAPB (general exception barrier)   
   and EXCB (floating point control register barrier) are both NOP's   
   internally, are tossed at decode, and don't even take up an   
   instruction slot.   
      
   The purpose of the EXCB is to synchronize pipeline access to the   
   floating point control and status register with FP operations.   
   In the worst case this stalls until the pipeline drains.   
      
   I wonder how much logic it really saved allowing imprecise exceptions   
   in the InO 21064 and 21164? Conversely, how much did it cost to deal   
   with problems caused by leaving these interlocks off?   
      
   The cores have multiple, parallel pipelines for int, lsq, fadd and fmul.   
   Without exception interlocks, each pipeline only obeys the scoreboard   
   rules for when to writeback its result register: WAW and WAR.   
   That allows a younger, faster instruction to finish and write its register   
   before an older, slower instruction. If that older instruction then throws   
   an exception and does not write its register then we can see the out of   
   order register writes.   
      
   For register file writes to be precise in the presence of exceptions   
   requires each instruction look ahead at the state of all older   
   instructions *in all pipelines*.   
   Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.   
   A writeback can occur if there are no WAW or WAR dependencies,   
   and all older uOps are Resolved_Normal.   
      
   Just off the top of my head, in addition to the normal scoreboard,   
   a FIFO buffer with a priority selector could be used to look ahead   
   at all older uOps and check their status, and allow or stall uOp   
   writebacks and ensure registers always appear precise.   
   Which really doesn't look that expensive.   
      
   Is there something I missed, or would that FIFO suffice?   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]