... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,587 of 131,241
BGB to MitchAlsup
Re: Variable-length instructions (3/3)
20 Dec 25 02:09:24
   [continued from previous message]   
      
   But, this sort of code tends not to map over to RV-C instructions all   
   that well.   
      
   So, one ends up with separate domains of:   
   Parallel stuff where the superscalar nails it, but almost entirely   
   32-bit ops;   
   Code with a lot of 16-bit ops that wouldn't do so well even if the   
   superscalar worked on RV-C ops.   
      
   > ------------------------------   
   >> Decoding at 2 or 3 wide seems to make the most sense:   
   >>     Gets a nice speedup over 1;   
   >>     Works with in-order.   
   >>   
   >> Here, 3 is slightly better than 2.   
   >> But, getting that much benefit from going any wider than this, is likely   
   >> to require some amount of "heavy lifting".   
   >   
   > Probably no conducive to FPGA implementations due to LUT count and   
   > special memories {predictors, ..., TLBs, staging buffers, ...}   
   >   
      
   Yeah.   
      
   The hardware in my case is still pretty dumb.   
   More in the areas of using lots of registers and round-robin allocating   
   them so that ILP is good; because RAW dependencies can really mess up ILP.   
      
   Round-robin register allocation eats a lot of registers though.   
      
      
      
   >> So, while a 4 or 5 wide in-order design could be possible, pretty much   
   >> no normal code is going to have enough ILP to make it worthwhile over 2   
   >> or 3.   
   >   
   > 1-wide 0.7 IPC   
   > 2-wide 1.0 IPC gain of 50%   
   > 3-wide 1.4 IPC gain of 40%   
   > 6-wide 2.2 IPC gain of 50% from doubling the width   
   > 10wide 3.2 IPC gain of 50% from almost doubling width   
   >   
      
      
   Depends a lot on the code, but yeah, I have seen enough gains that 3   
   benefits over 2, but 4 or 5 hits a bottleneck.   
      
   Would need to have wackiness like register renaming and similar.   
      
      
      
   >> Also 2 or 3 works reasonably well with a 96-bit fetch:   
   >   
   > But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide.   
   > ------------------------   
      
   In x86, yes, maybe.   
      
      
   For my CPU core, fetch is 96 bits.   
      If you fetch a 96-bit instruction, it only fetches 1 instruction;   
      So, currently superscalar only happens with narrower instructions.   
      
      
   For XG3, I am coming out with an average-case instruction size of 32.2 bits.   
      
   For RV64G+JX: 33.92 bits.   
      
   Mostly because RV64G+JX seems to have a larger proportion of Jumbo prefixes.   
      
      
   However, superscalar only on 32-bit encodings works out OK as jumbo   
   prefixed instructions are a relative minority of the total here.   
      
      
   It is more RV-C with ~ 30% or so of the instructions becoming 16-bit,   
   and roughly half the time the 32-bit instructions are misaligned, that   
   superscalar gets wrecked.   
      
   Granted, one can argue that superscalar that doesn't deal with 16-bit   
   ops, misaligned ops, or instruction sequences crossing cache-line   
   boundaries, is maybe kinda lame...   
      
      
      
   >> One trick here could be to precompute a lot of this when fetching cache   
   >> lines, though a full instruction length could not be determined at fetch   
   >> time if the instruction crosses a cache line unless we have also fetched   
   >> the next cache line. Full instruction length could be determine in   
   >> advance (at fetch time) if it always fetches both cache-lines and then   
   >> determines the lengths for one of them before writing to the cache   
   >> (possibly if the next line is fetched, it contents are not written to   
   >> the cache as lengths can't be fully determined yet).   
   >   
   > All of the above was solved in Athlon, and then made 3× smaller in Opteron   
   > at the cost of 1 pipe stage in DECODE.   
      
   Maybe, but I don't know how they implemented it. I am just sorta   
   guessing how it could be implemented assuming I were interested in   
   implementing an x86 core...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]