... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,575 of 131,241
BGB to Anton Ertl
Re: Variable-length instructions (1/2)
19 Dec 25 15:07:38
   From: cr88192@gmail.com   
      
   On 12/18/2025 4:25 PM, Anton Ertl wrote:   
   > John Savard  writes:   
   >> Given the great popularity of the RISC architecture, I assumed that one of   
   >> its characteristics, instructions that are all 32 bits in length, produced   
   >> a great increase in efficiency over variable-length instructions.   
   >   
   > Some RISCs have that, some RISCs have two instruction lengths: 16 bits   
   > and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,   
   > but then eliminated in Power), one variant of Berkeley RISC, ARM T32,   
   > RISC-V with the C extension, and probably others.   
   >   
      
   I have come to realize that 32/64 is probably better than 16/32 here,   
   primarily in terms of performance, but also helps with code-density (a   
   pure 32/64 encoding scheme can beat 16/32 in terms of code-density   
   despite only having larger instructions available).   
      
   One could argue "But MOV is less space efficient", can note that it also   
   makes sense to try to design the compiler to minimize the number of   
   unnecessary MOV instructions and similar (and when using the minimal   
   number of register moves, the lack of a small MOV encoding has less   
   effect on code density).   
      
      
   16/32/64 is also sensible, but the existence of 16-bit ops negatively   
   effects encoding space (it is more of a strain to have both 16-bit ops   
   and 6-bit register fields; but at least some code can benefit from   
   having 64 GPRs).   
      
   So, say:   
      16/32: RV64GC (OK code density)   
      16/32/64: RV64GC+JX: Better code density than RV64GC.   
      32/64: RV64G+JX (seemingly slightly beats RV64GC)   
        But, not as much as GC+JX.   
      16/32/64/96: XG1 (still best code for density).   
      32/64/96: XG2 and XG3;   
        Also good for code density;   
        Somehow XG3 loses to XG2 despite being nearly 1:1;   
        Though, XG3 has mostly claimed the performance crown.   
      
   Or, descending, code-density:   
      XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G   
   And, performance:   
      XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC   
      
      
   Where, both the 16-bit ops, and some lacking features (in RV64G and   
   RV64GC) negatively effecting things.   
      
   Where, the main things that benefit JX here being:   
      Jumbo prefixes, extending Imm12/Disp12 to 33 bits;   
      Indexed Load/Store;   
      Load/Store Pair;   
      Re-adding ADDWU/SUBWU and similar.   
        The Zba instructions also help,   
          but Load/Store pair greatly reduces effect of Zba.   
      
      
   It would be possible to get better code density than 'C' with some tweaks:   
      Reducing many of the imm/disp fields by 1 bit;   
        Would free up a lot of encoding space.   
        Imm6/Disp6 eats too much encoding space here.   
      Making most of the register fields 4 bits (X8..X23)   
        Can improve hit-rate notably over Reg3.   
      
   But:   
      Main merit of 'C' is compatibility with binaries that use 'C';   
      This merit would be lost by modifying or replacing 'C'.   
      
      
   Had mostly ended up leaving out 96-bit encodings for RV+JX mostly   
   because the encoding scheme kinda ruins it in this case (not really a   
   good way to fix RISC-V's encodings to make it not suck).   
      
   Tempting to consider collapsing the 80+ bit space into 96 or 96/128.   
      
   So, say:   
       xx-xxx-00: 16-bit   
       xx-xxx-01: 16-bit   
       xx-xxx-10: 16-bit   
       xx-xxx-11: 32+ bits   
       x0-111-11: 48 bits   
       01-111-11: 64 bits   
       11-111-11: 80+ bits (uses Func3 for 16-bit count)   
   But, say:   
       11-111-11: 96 bits   
   Or, say:   
       ...-xx0-nnnnn-11-111-11:  96 bits   
       ...-001-nnnnn-11-111-11: 128 bits   
       ...-011-nnnnn-11-111-11: 192 bits (eventual)   
       ...-101-nnnnn-11-111-11: 256 bits (eventual)   
       ...-111-nnnnn-11-111-11: 384 bits (eventual)   
      
   Replacing: 80/96/112/128/144/160/176/192.   
      Don't really need fine grained bucket sizes for large encodings.   
      
   Where, having a few more bits for 96-bit ops makes them more usable.   
      The main use for 96-bit here being for possible Imm64 ALU encodings.   
      
   At least with my existing JX scheme, there is not enough encoding space   
   to allow for Imm64 ALU encodings (which reduces the usefulness of having   
   96-bit encodings).   
      
      
      
      
   >> Before modern pipelined computers, which have multi-stage pipelines for   
   >> instruction _execution_, a simple form of pipelining was very common -   
   >> usually in the form of a three-stage fetch, decode, and execute pipeline.   
   >> Since the decoding of instructions can be so neatly separated from their   
   >> execution, and thus performed well in advance of it, any overhead   
   >> associated with variable-length instructions becomes irrelevant because it   
   >> essentially takes place very nearly completely in parallel to execution.   
   >   
   > It is certainly possible to decode potential instructions at every   
   > starting position in parallel, and later select the ones that actually   
   > correspond to the end of the previous instruction, but with 16-bit and   
   > 32-bit instructions this potentially doubles the amount of instruction   
   > decoders necessary, plus the circuit for selecting the ones that are   
   > at actual instruction starts.  I guess that this is the reason why ARM   
   > uses an uop cache in cores that can execute ARM T32.  The fact that   
   > more recent ARM A64-only cores have often no uop cache while their   
   > A64+T32 predecessors have had one reinforces this idea.   
   >   
      
   I took the option of not bothering with parallel execution for 16-bit ops.   
      
   This does leave both XG1 and RV64GC (when using Compressed encodings) at   
   a performance disadvantage. But, dealing with superscalar decoding for   
   16-bit ops would add too much cost here.   
      
   For an ISA like RV64GC, it could be possible in theory (if the compiler   
   knows which functions are in the hot and cold paths) to use 16-bit   
   encodings in the cold path but then only 32-bit encodings in the hot   
   path (which also need to be kept 32-bit aligned).   
      
      
   Even if 16-bit ops could be superscalar though, the benefits would be   
   small: Code patterns that favor 16-bit ops also tend to be lower in   
   terms of available ILP.   
      
   Or, the reverse:   
   Patterns that maximize ILP (such as unrolling and modulo-scheduling   
   loops) tend to be hostile to the constraints of 16-bit encoding schemes.   
      
      
   Decoding at 2 or 3 wide seems to make the most sense:   
      Gets a nice speedup over 1;   
      Works with in-order.   
      
   Here, 3 is slightly better than 2.   
   But, getting that much benefit from going any wider than this, is likely   
   to require some amount of "heavy lifting".   
      
   So, while a 4 or 5 wide in-order design could be possible, pretty much   
   no normal code is going to have enough ILP to make it worthwhile over 2   
   or 3.   
      
   Also 2 or 3 works reasonably well with a 96-bit fetch:   
     Can do 1x: 32/64/96   
     Can do 2x or 3x 32-bit;   
     Could do (potentially) 32+64 or 64+32.   
       64-bit ops being somewhat less common than 32 bit ops.   
       96-bit ops are statistically infrequent.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]