... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,523 of 131,241
Anton Ertl to MitchAlsup
Re: VAX (was: Why I've Dropped In)
27 Aug 25 17:19:06
   From: anton@mips.complang.tuwien.ac.at   
      
   MitchAlsup  writes:   
   >   
   >anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >> given that microcode no longer made sense for VAX, POLY did not make   
   >> sense for it, either.   
   ...   
   >[...] POLY as an   
   >instruction is bad.   
      
   Exactly.   
      
   >One must remember that VAX was a 5-cycle per instruction machine !!!   
   >(200ns : 1 MIP)   
      
   It's better to forget this misinformation, and instead remember that   
   the VAX has an average CPI of 10.6 (Table 8 of   
   )   
      
   Table 9 of that reference is also interesting:   
      
   CALL/RET instructions take an average 45 cycles, Character   
   instructions (I guess this means stuff like EDIT) takes an average 117   
   cycles, and Decimal instructions take an average 101 cycles.  It seems   
   that these instructions all have no special hardware support on the   
   VAX 11/780 and do it all through microcode.  So replacing Character   
   and Decimal instructions with calls to functions on a RISC-VAX could   
   easily outperform the VAX 11/780 even without special hardware   
   support.  Now add decimal support like the HPPA has done or string   
   support like the Alpha has done, and you see even better speed for   
   these instructions.   
      
   For CALL/RET, one might use one of the modern calling conventions.   
   However, this loses some capabilities compared to the VAX.  So one may   
   prefer to keep frame pointers by default and maybe other features that   
   allow, e.g., universal cross-language debugging on the VAX without   
   monstrosities like ELF and DWARF.   
      
   >Pipeline work over 1983-to-current has shown that LD and OPs perform   
   >just as fast as LD+OP. Also, there are ways to perform LD+OP as if it   
   >were LD and OP, and there are way to perform LD and OP as if it were   
   >LD+OP.   
      
   I don't know what you are getting at here.  When implementing the 486,   
   Intel chose the following pipeline:   
      
   Instruction Fetch   
   Instruction Decode   
   Mem1   
   Mem2/OP   
   Writeback   
      
   This meant that load-and-op instructions take 2 cycles (and RMW   
   instructions take three); it gave us the address-generation interlock   
   (op-to-load latency 2), and 3-cycle taken branches.  An alternative   
   would have been:   
      
   Instruction Fetch   
   Instruction Decode   
   Mem1   
   Mem2   
   OP   
   Writeback   
      
   This would have resultet in a max throughput of 1 CPI for sequences of   
   load-and-op instructions, but would have resultet in an AGI of 3   
   cycles, and 4-cycle taken branches.   
      
   For the Bonnell Intel chose such a pipeline (IIRC with a third mem   
   stage), but the Bonnell has a branch predictor, so the longer branch   
   latency usually does not strike.   
      
   AFAIK IBM used such a pipeline for some S/360 descendants.   
      
   >Condition codes get hard when DECODE width grows greater than 3.   
      
   And yet the widest implementations (up to 10 wide up to now) are of   
   ISAs that have condition-code registers.  Even particularly nasty ones   
   in the case of AMD64.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
     Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]