... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,651 of 131,241
MitchAlsup to All
Re: A new method for OoO
11 Sep 25 18:48:06
   From: user5857@newsgrouper.org.invalid   
      
   EricP  posted:   
      
   > Thomas Koenig wrote:   
   > > https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc   
   v-core-at-hot-chips-2025/   
   > > has an interestig take on how to do OoO (quite patented,   
   > > apparently).  Apparently, they predict how many cycles their   
   > > instructions are going to take, and replay if that doesn't work   
   > > (for example in case of an L1 cache miss).   
   > >   
   > > Sounds interesting, I wonder what people here think of it.   
   >   
   > I searched for "processor" "schedule" "time resource matrix" and got   
   > a hit on a different company's patent for what looks like the same idea.   
   >   
   > Time-resource matrix for a microprocessor with time counter   
   > for statically dispatching instructions   
   > https://patents.google.com/patent/US11829762B2   
   >   
   > It basically puts all the schedule in one HW matrix of time_slots * resources   
   > and scans forward looking for empty slots to allocate to each instruction.   
   > The scheduling is done at Rename and time slots assigned for each resource   
   > needed, source operand read ports, FU's, result buses.   
   > If a load later misses L1 it triggers a replay of all younger instructions.   
   >   
   > They claim it is simpler but I question that.   
      
   Scoreboards are simpler than RS (and smaller too) but come with a   
   10%-odd disadvantage in performance (per frequency). The purported   
   scheme is 7%-odd slower--read into that anything you want.   
      
   > Putting all the schedule info in one matrix means that to scale it   
   > requires adding more ports to the matrix. Also different resources   
   > can require different allocation and scheduling algorithms.   
   > Doing all this in one place at the same time gets complicated quickly.   
      
   Scoreboards scale with inst^2+registers^3   
   Stations    scale with inst×FU+RoB   
      
   CDC got away with a Scoreboard because it tracked 3 sets of 8 registers;   
   doing this with 32 uniform registers would be 8× as big !!! and somewhat   
   slower; doing this with 32+32 {int,FP} rgisters would be 16× worse than   
   6600; adding in SIMD and I don't even know how to calculate it.   
      
   > My simulated design intentionally distributed schedulers to each FU's bank   
   > of reservation stations so they all schedule concurrently and each scheduler   
   > algorithm is optimized for its FU.   
      
   Each entire pipelined sequence is optimized for its pipeline::   
      
                 | INT  RS |  INT   | Result|   
                 | MEM  RS |  AGEN  | Cache | LDalgn | result|   
        | DECODE | FMAC RS |  MUX   | MUL   | ADD    | NORM  | Result|   
                 | MISC RS |  stuff | Result|   
                 | BR   RS | Check  | backup|   
      
   > Also a wake-up matrix is not that complicated. I used the write of the   
   > destination Physical Register Number (PRN) as the wake-up signal.   
      
   Agreed: however, I pipelined the result delivery mechanism into 3 stages::   
   {tag, result, exception} with the following timing::   
      
        |   tag   |  tag+1  |  tag+2  |   
                  | result  | result+1| result+2|   
                       | excptn  | excptn+1| excptn+2|   
      
   Tag consists of {pRN, pValid; slot, CKid, cValid}   
   pRN    is the physical Register Number   
   pValid tells if you are writing the pRF   
   slot   is which FU   
   CKid   is which Insert BUndle   
   cValid tells if {slot, CKid} is delivering a result   
      
   There is a case where aRN is written more than once in a single Insert   
   Bundle, in these cases, its result is delivered only to RS entries   
   waiting on {slot, CKid}; Here a pRN is not assigned to the result   
   only a {slot, CKid}; hence pValid.   
      
   There is the case where {slot, CKid} is not delivering a result;   
   hence cValid. This is used for ST instructions to read pRF after   
   all exceptions in the bundle have accrued. This eliminates forwarding   
   on ST.data since all older results have  been written   
   into pRF.   
      
   The exception timing allows for direct mapped caches to deliver data   
   while checking for hit, and delivering miss after LD.data. It also   
   allows for instructions like FDIV to deliver a result and then change   
   its mind later. Mc88120 could deliver FDIV at cycle 12 with 1/128   
   chance in improper rounding, re-delivering the correctly rounded   
   result in cycle 17. SQRT was similar.   
      
   The only real complication is that 1-cycle instructions have RS broadcast   
   the tag instead of the dedicated FU.   
      
   > Each PRN has a wire the runs to all RS and each operand waiting for   
   > that PRN watches that wire for a pulse indicating the write result value   
   > will be forwarded in the next cycle on a dynamically assigned result bus.   
      
   When an instructions are written (Insert) into RS, each operant contains   
   the slot of the FU which will deliver that result. Thus, the Operand   
   capture portion only "looks" at one result bus for its data. Mc88120   
   1991.   
      
   > The RS operand can either save a copy of the value or launch execution   
   > immediately if all resources are available.   
   >   
   > My design appears to be similar to issue logic for   
   > RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers   
   > are simple and different kinds can be used for different FU.   
   > My ALU used simple round-robin whereas Branch Unit BRU is age ordered.   
   > This is simple to do as each scheduler only looks at its own RS bank.   
      
   I always considered the FU scheduler to be the RS "everybody ready?"   
   OK "let's choose the oldest ready instruction !!" That is each FU has   
   a dedicated RS on its front, and a dedicated result  at its rear.   
      
   > https://docs.boom-core.org/en/latest/sections/issue-units.html   
      
   The BOOM front end seems to have a lot more cycles than what is required.   
      
   I am working on a 6-wide GBOoO implementation, and FETCH-PARSE-DECODE-INSERT   
   is only 3½ cycles--while if RS does not launch an instruction, the    
   decoded instruction can begin {INSERT can be EXECUTE} in that 4rd-cycle   
   delivering its result in cycle-5.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]