... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,586 of 131,241
BGB to MitchAlsup
Re: Variable-length instructions (2/3)
20 Dec 25 02:09:24
   [continued from previous message]   
      
      XG1 still has the best code density (smallest binaries);   
      XG3 currently has the best performance, but worst code density.   
      
   As noted, Doom ".text" sizes (BGBCC, re-running some of them):   
      XG1      : 276K   
      XG2      : 296K   
      RV64GC+JX: 302K   
      XG3      : 320K   
      RV64G+JX : 340K   
      RV64GC   : 370K   
      RV64G    : 440K   
      
      
   Though, everything is pretty close together here (sizes and rankings   
   tend to jostle around some).   
      
      
      
   Vs:   
     GCC/ELF, RV64GC: 1166K (*)   
     GCC ELF, x86-64:  480K   
     MSVC EXE, X64  :  770K   
      
   (*): The GCC ELF binary seems to contain significant amounts of   
   metadata. More space burned on symbol tables than on the code itself.   
      
      
   >> So, say:   
   >>     16/32: RV64GC (OK code density)   
   >>     16/32/64: RV64GC+JX: Better code density than RV64GC.   
   >>     32/64: RV64G+JX (seemingly slightly beats RV64GC)   
   >>       But, not as much as GC+JX.   
   >>     16/32/64/96: XG1 (still best code for density).   
   >>     32/64/96: XG2 and XG3;   
   >>       Also good for code density;   
   >>       Somehow XG3 loses to XG2 despite being nearly 1:1;   
   >>       Though, XG3 has mostly claimed the performance crown.   
   >>   
   >> Or, descending, code-density:   
   >>     XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G   
   >> And, performance:   
   >>     XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC   
   >   
   > Rather than tracking code density--which measures cache performance,   
   > I have come to think that counting instructions themselves is the key.   
   > If the instruction is present then it ahs to be executed, if not, then   
   > it was free !! in all real senses.   
   >   
   >> Where, both the 16-bit ops, and some lacking features (in RV64G and   
   >> RV64GC) negatively effecting things.   
   >   
   > Like a reasonable OpCode layout.   
   >   
      
   I suspect even without it, they likely still would have turned Imm/Disp   
   fields into confetti.   
      
      
   Does make me half wonder how an ISA would look on average if the design   
   approach consisted of drawing cards and rolling a d20 to assign bits   
   ("roll d20 for where the register field goes" or "roll d20 for which   
   immediate bit comes next").   
      
      
      
   >> Where, the main things that benefit JX here being:   
   >>     Jumbo prefixes, extending Imm12/Disp12 to 33 bits;   
   > I have no prefixes {well CARRY}   
   > ±Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64   
      
   I used prefix encodings both for my own ISA and extending RISC-V.   
      
      
   >>     Indexed Load/Store;   
   > check   
      
   I will probably at some point need to switch to the Zilx encodings.   
      
   Probably once it gets official approval as an extension, I will go over   
   to it. Though, the original proposal was reduced to only having   
   Load-Indexed as the RISC-V ARC people really dislike Indexed-Store, and   
   also Indexed-Store has a lower usage frequency than Indexed-Load.   
      
      
   >>     Load/Store Pair;   
   > LDM, STM, ENTER, EXIT, MM, MS   
      
   For JX, only had Load/Store Pair...   
      
      
   >>     Re-adding ADDWU/SUBWU and similar.   
   > {int,float}×{OpCode}×{Byte, Half, Word, DBLE}   
      
   ADDWU/SUBWU are limited in scope.   
      
   In my own ISA, I had the same functionality as ADDU.L and SUBU.L, ...   
      
   They existed in early BitManip but were dropped in favor of further   
   canonizing the use of ADDW (for sign-extending unsigned int) and the   
   mess that is the ".UW" instructions (which are multiplying, much to my   
   annoyance).   
      
   Better IMO to fix things in ways that are "not stupid" vs just throwing   
   more cruft at the problem.   
      
   In my compiler, I was like, "Yeah, no, I am going to do this stuff in a   
   way that isn't stupid".   
      
   But, the rest of the RISC-V community is bent on pushing forward down   
   this path, which means ".UW" cruft (instructions which selectively   
   zero-extend Rs2).   
      
      
      
   >>       The Zba instructions also help,   
   >>         but Load/Store pair greatly reduces effect of Zba.   
   >>   
   >>   
   >> It would be possible to get better code density than 'C' with some tweaks:   
   >>     Reducing many of the imm/disp fields by 1 bit;   
   >>       Would free up a lot of encoding space.   
   >>       Imm6/Disp6 eats too much encoding space here.   
   > Which I why ±imm5 works better.   
      
   Yes.   
      
   For 16 bit instructions, having a bunch of instructions with 6 bit   
   immediate and displacement values eats too much encoding space.   
      
      
   >>     Making most of the register fields 4 bits (X8..X23)   
   >>       Can improve hit-rate notably over Reg3.   
   >>   
   >> But:   
   >>     Main merit of 'C' is compatibility with binaries that use 'C';   
   >>     This merit would be lost by modifying or replacing 'C'.   
   >   
   > I can still fit my entire ISA into the space vacated by C.   
      
      
   Likewise...   
      
   This is where XG3 came from:   
   Drop RISC-V C extension;   
   Awkwardly shove nearly the entirety of XG2 into the hole that was left over;   
   ...   
      
   Well, granted, I couldn't fit the *entirety* of XG2 into the hole:   
   It lost WEX and a few misc features in the process;   
   So, XG3 goes over to using a RISC superscalar approach rather than LIW,   
   but, more or less...   
      
   It kept predication, had I not kept predication, XG3 would have used   
   around 1/3 the encoding space that it currently uses.   
      
      
      
   > ----------------------   
   >>> It is certainly possible to decode potential instructions at every   
   >>> starting position in parallel, and later select the ones that actually   
   >>> correspond to the end of the previous instruction, but with 16-bit and   
   >>> 32-bit instructions this potentially doubles the amount of instruction   
   >>> decoders necessary, plus the circuit for selecting the ones that are   
   >>> at actual instruction starts.  I guess that this is the reason why ARM   
   >>> uses an uop cache in cores that can execute ARM T32.  The fact that   
   >>> more recent ARM A64-only cores have often no uop cache while their   
   >>> A64+T32 predecessors have had one reinforces this idea.   
   >>>   
   >>   
   >> I took the option of not bothering with parallel execution for 16-bit ops.   
   >   
   > I took the option of not bothering with 16-bit Ops.   
      
   Well, as noted, if I were doing it again, I wouldn't recreate XG1 as it is.   
      
   As for RISC-V's C extension, only reason I had added this was because it   
   seemed basically unavoidable if I wanted binary compatibility with Linux   
   binaries (and GCC).   
      
   Seemingly, even if configured for RV64G, "GNU glibc" still goes and   
   manages to throw 'C' instructions into the mix.   
      
   With the apparent rise of the RVA23 Profile, am probably going to need   
   to deal somehow with the V extension as well, but my current plan is to   
   try to deal with this via traps and hot patching rather than actually   
   supporting V in HW.   
      
      
      
      
   > -----------------------   
   >> Even if 16-bit ops could be superscalar though, the benefits would be   
   >> small: Code patterns that favor 16-bit ops also tend to be lower in   
   >> terms of available ILP.   
   >   
   > I suspect that argument setup before and result take-down after call   
   > would have quite a bit of parallelism.   
   > I suspect that moving fields around for the next loop iteration would   
   > have significant parallelism.   
      
   Blobs of Loads and Stores are not ILP that my CPU core can use...   
      
   It is mostly limited to ILP involving ALU ops and similar.   
      
      
   A lot of areas dominated by RV-C ops tend to be heavy in RAW   
   dependencies and instruction chains depending on the prior instruction.   
   This isn't stuff that really does well.   
      
   Stuff that gets better ILP tends to look more like:   
   	s1=i0-(i1>>1);		s0=s1+i1;   
   	s3=i2-(i3>>1);		s2=s3+i3;   
   	s5=i4-(i5>>1);		s4=s5+i5;   
   	s7=i6-(i7>>1);		s6=s7+i7;   
   	t1=s0-(s2>>1);		t0=t1+s2;   
   	t3=s1-(s3>>1);		t2=t3+s3;   
   	t5=s4-(s6>>1);		t4=t5+s6;   
   	t7=s5-(s7>>1);		t6=t7+s7;   
   	u1=t0-(t4>>1);		u0=u1+t4;   
   	u3=t1-(t5>>1);		u2=u3+t5;   
   	u5=t2-(t6>>1);		u4=u5+t6;   
   	u7=t3-(t7>>1);		u6=u7+t7;   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]