From: user5857@newsgrouper.org.invalid   
      
   Stefan Monnier posted:   
      
   > >>Maybe a way to avoid that problem   
   > >>is to make the renaming architectural. I.e. add a "register renaming   
   > >>table" (RRT), and introduce the instruction RENAME which changes   
   > >>that RRT. Whenever an instruction wants to read register Rn, the actual   
   > >>architectural register we'll read is obtained by passing `n` through RRT.   
   > >   
   > > All of that happens with microarchitectural renaming (your RRT is   
   > > called RAT (register alias table), however). Your "RENAME"   
   > > instruction is called "MOV". Why make the RAT architectural?   
   >   
   > Good question. I was just reacting to Mitch who seemed to say that one   
   > of the main problems with a multi-move instruction is that it has too   
   > many output and that doesn't fit into the general design, so by making   
   > the RRT/RAT architectural it makes the instruction single-output.   
   > I don't know if in practice it would make any difference.   
      
   There is a whole "bunch of things" that are being conflated here.   
   a) single cycle renamers do not do 0-cycle MOVs;   
   b) whereas once the renamer takes 3 cycles, you are pretty much required   
    to perform 0-cycle moves in order to make up for the renaming latency.   
   c) getting register specifiers to the rename ports becomes harder   
    as the number of writes per instruction goes up.   
   d) LDM instructions are a special case because the architectural registers   
    are all sequential, so one can special-case architectural delays while   
    renaming fairly easily.   
   e) the data-path can perform as many MOVs per cycle as it has Function   
    Units--so, you are not buying calculation "slots" when   
    doing 0-cycle MOVs.   
      
   And a lot of this comes down to HOW one renames registers in your   
   implementation.   
      
   A physical register file is essentially the logical outcome when the   
   Reorder Buffer becomes big enough that no RF reads come from   
   the RF and all come from the ROB. Here you add the   
   architectural RF-names to the ROB and avoid data movement at retire.   
   Mc 88120 had such an organization.   
      
   Architectural registers were read by CAM, and each CAM had a valid bit.   
   There is always a CAM with the Architectural Register Number in a valid   
   state. When matched, the CAM selected the register and it was read out,   
   in addition, there was a 3-state "state" read out, and the Physical   
   Register Name and which Function unit would deliver this pending result.   
      
   All of this would be dumped into the Reservation station and (Mc88120)   
   would write this into Reservation Station. If the value was in the pending   
   state it would be forwarded into RS. If RS was not launching instruction,   
   The just-Decoded instruction would be launched into Execution (just in case   
   and checked later).   
      
   Each cycle, the valid bits of the CAM were transferred into the History   
   buffer, there were 2-bits transferred, the valid bits if the Decode was   
   backed up (for any reason) and the valid bits if the Decode was successful.   
   There was a layer of logic between entries in the History Buffer that   
   amalgamated the register status, so that one could retire all Decode   
   cycles in a single cycle (catch up BW).   
      
   When a branch instruction was Decoded, the instruction gets associated with   
   the index of the History Buffer as a checkpoint. Mc88120 read the "instruction   
   cache" twice per cycle, once on the predicted direction and once on the backup   
   direction. The backup direction was placed in the recovery buffer with the   
   index of the branch in its RS.   
      
   When a branch instruction was launched, it provided an index into History   
   Buffer, and if the branch had to be backed up, the History buffer could   
   provide the valid bits for the subsequent Decode cycle with 0-delay. In   
   order for this to work (0-cycle recovery) as the branch was launched the   
   recovery buffer was read, and if the branch was mispredicted, we already   
   had instructions from the non-predicted path to feed into Decode.   
      
   So, here, DECODE, RF read, Rename, Checkpointing, data-flow forwarding,   
   was all integrated into a single "resource" with a single coordinating   
   sequencer.   
      
   Rename: you could say it took 1-cycle, but was commensurate with Decode.   
   Recovery: you could say it took 1-cycle, but was commensurate with Branch   
   resolution.   
   RF Backup: you could say it took 1-cycle, but was commensurate with Branch   
   resolution.   
      
   All of which are a far cry from {3,4,5} cycle Decode+Rename.   
   And {2,3,4} cycle Backup.   
      
   Also note: Due to handling all of this in unary form, we could backup a   
   branch AND retire 1 or more Groups in the same cycle with a single OR gate   
   per renamable register.   
      
   Given the 1-cycle Decode->RS and the 0-cycle Branch mispredict recovery   
   we found no "particular" benefit to 0-cycle MOVs.   
      
   All of this was 1991-92. We were even getting 2.2 IPC out of SPECint XLISP,   
   averaging 3.1 IPC from SPECint, and getting 5.97 IPC from MATRIX300 without   
   a L2 cache from Mc 88100 ISA from code compiled for 88100 without modification   
   and with a 4KB GShare branch predictor and a 16KB DM 4-banked DCache.   
   >   
   >   
   > Stefan   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|