... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,672 of 131,241
BGB to Anton Ertl
Re: Variable-length instructions (1/2)
29 Dec 25 19:54:41
   From: cr88192@gmail.com   
      
   On 12/29/2025 12:35 PM, Anton Ertl wrote:   
   > EricP  writes:   
   >> Thomas Koenig wrote:   
   >>> Using a primitive Perl script to catch occurences, on a recent   
   >>> My 66000 cmopiler, of the shape   
   >>>   
   >>> 	[op] Ra,Ra,Rb   
   >>> 	[op] Ra,Rb,Ra   
   >>> 	[op] Ra,#n,Ra   
   >>> 	[op] Ra,Ra,#n   
   >>> 	[op] Ra,Rb   
   >>>   
   >>> where |n| < 32, which could be a reasonable approximation of a   
   >>> compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)   
   >>> and 23.9% (GSL) of such instructions.  Potential space savings   
   >>> would be a bit less than half that.   
   >>>   
   >>> Better compression schemes are certainly possible, but I think the   
   >>> disadvantages of having more complex encodings outweigh any   
   >>> potential savings in instruction size.   
   >   
   > The RISC-V people brag about how little their compressed encoding   
   > costs to decode; IIRC it's in the hundreds of something (not sure if   
   > transistors or gates).  Of course, with superscalar decoding the   
   > compressed instruction set costs additional decoders plus logic to   
   > select which decodings do not belong to actual instructions, but   
   > that's true for any 16+32-bit encoding, however simple.   
   >   
      
   It has its own hair:   
   Multiple schemes for encoding immediate values and displacements;   
   Multiple ways to encode register fields;   
   Each type of Load/Store instruction effectively has its own displacement   
   encoding;   
   ...   
      
   I am skeptical of it being the cheapest possible, or the best possible.   
      
   But, not much viable reason to change it either:   
      Main reason to have it is compatibility;   
      Compatibility would be lost with any notable design change.   
      
      
      
   Granted, had noted that the decoders for my own ISA are more expensive   
   and have worse timing at present than the RISC-V decoders, but this   
   includes XG1+XG2+XG3.   
      
   Much simplification and cost reduction would be possible if XG1 and XG2   
   were dropped. It is possible if I did a new core, I might consider   
   making RISC-V the primary ISA with XG3 as the secondary ISA. Would   
   likely keep much of the low-level architecture similar though, with some   
   level of firmware-level wonk.   
      
   Keeping XG3 around does still make some sense:   
      Better performance than RISC-V;   
      Better code density than RISC-V when under similar constraints;   
      Has SIMD that isn't complicated and expensive.   
        If I were to support RV-V,   
           it is likely to be via traps or hot-patching.   
      ...   
      
      
   Pros/cons that standard Linux distros seem to assume EFI rather than   
   direct hardware control in many cases.   
      
   Where, EFI allows providing more abstraction, but couldn't really be fit   
   into a 32K or 48K ROM space. I suspect I would more likely need upwards   
   of 200K to pull this off if EFI were done in ROM, though could probably   
   stick with the existing 32K ROM if its main purpose is to load an image   
   from an SDcard or similar.   
      
   On some boards (such as the Nexys 7) it is possible in theory to load   
   both the CPU's bitstream and possibly the EFI firmware into an on-board   
   QSPI Flash module and leave the SDcard mostly for the OS proper (vs   
   generally having both the bitstream and possible BIOS on the SDcard).   
      
      
   In some ways, firmware can hide that not even all of RV64G is   
   implemented in hardware, because some parts either can't be implemented   
   effectively, or don't make sense from a cost/benefit POV to support   
   natively.   
      
      
   >> So the % numbers you measured might just be coincidence and could be low.   
   >> An ISA with both short 2- and long 3- register formats like RV where there   
   >> is an incentive to do this optimization might provide stats confirmation.   
   >   
   > I have done the following on a RV64GC system with Fedora 33:   
   >   
   > objdump -d /lib64/lp64d/libperl.so.5.32|grep '^ *[0-9a-f]*:'|awk '{print   
   length($2)}'|sort|uniq -c   
   >   215782 4   
   >   179493 8   
   >   
   > 16-bit instructions are reported as 4 (4 hex digits), 32-bit   
   > instructions are reported as 8.   
   >   
   > If the actual binary /usr/bin/perl is meant, here's the stats for that:   
   >   
   > objdump -d /usr//bin/perl|grep '^ *[0-9a-f]*:'|awk '{print len   
   th($2)}'|sort|uniq -c   
   >      105 4   
   >      167 8   
   >   
   > gnuplot is not installed, and GSL is not installed, either, whatever   
   > it may be.   
   >   
   > Just to widen the basis, here are a few more:   
   >   
   > zstd:   
   >   129569 4   
   >   134985 8   
   >   
   > git:   
   >   305090 4   
   >   274053 8   
   >   
   > /usr/lib64/libc-2.32.so:   
   >   142208 4   
   >   113455 8   
   >   
   > So the percentage of 16-bit instructions is a lot higher than for the   
   > schemes that Thomas Koenig has looked at.   
   >   
      
   In my own testing, was seeing usually around:   
      60% 32-bit   
      40% 16-bit   
   Resulting in typically around a 20% reduction in code size (vs RV64G).   
      
   At least, with a compiler that doesn't specifically tailor its code   
   generation to favor RV-C (and/or code that fits RV-C's patterns).   
      
      
   One usual downside is that to utilize a 16-bit ISA with a smaller   
   register space, one needs to reuse registers more frequently, which then   
   reduces ILP due to register conflicts. So, smaller code at the expense   
   of worse performance.   
      
      
      
   For XG1, it was possible to tune things for a higher percentage of   
   16-bit ops. Though in this case it meant largely limiting things to the   
   low 16 registers except in "higher register pressure" scenarios, but   
   this negatively effects speed.   
      
   So, say (for XG1):   
      R2..R15 only: Used as the default scheme;   
        Had 6 scratch registers, 7 callee save, 3 SPR.   
      R16..R31: Enabled if register pressure exceeds a threshold;   
      R32..R63: Only enabled under very high register pressure.   
      
   The threshold between when to enable R16..R31 differed some based on   
   optimization level (raised with size optimization, lowered with speed   
   optimization). Threshold for R32..R63 needed to be kept higher, as much   
   of the ISA only natively supported the first 32 GPRs (for other parts of   
   the ISA, using the high 32 registers would require 64-bit encodings).   
      
      
   As noted, size optimization favored size, performance optimization   
   favors performance, and in some places they are at odds with each other.   
      
   Ironically, even when the binary is dominated by 16-bit ops, the   
   relative code-size reductions are modest; and a fixed-length 16-bit ISA   
   can actually be worse here than a 16/32 ISA.   
      
      
   Similar happens with RISC-V, except that ironically, the limitations of   
   RV-C can negatively effect size optimization as well. It is almost like   
   there is one place RV-C does well:   
      Small leaf functions.   
   Everywhere else, it is weaker.   
      
   And RV-C is basically too weak/limited to be used by itself as a primary   
   ISA (unlike either Thumb, or XG1's 16-bit ops).   
      
      
      
   For XG2, it made sense to use a different scheme:   
      R2..R31: Always enabled by default;   
      R32..R63: Enabled for high register pressure.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]