... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,673 of 131,241
BGB to Anton Ertl
Re: Variable-length instructions (2/2)
29 Dec 25 19:54:41
   [continued from previous message]   
      
   Despite XG2 having 64 GPRs, always enabling all 64 GPRs had a slight   
   negative effect on both code-density and performance (mostly to making   
   prologs/epilogs slightly larger on average, and increasing the size of   
   the average stack frame).   
      
   However, always using 32 GPRs was generally better for performance than   
   using 32 GPRs sparingly.   
      
      
   For XG3, it is basically a similar scheme to RV+JX, namely:   
      Low/moderate pressure:   
        Assume X and F are split (int/ptr on X side; FPU on F side);   
      High pressure:   
        Merge the spaces.   
      
   Though, for XG3 vs RV+JX, it makes sense to use a lower threshold for   
   XG3, and a higher threshold for RV+JX. This is because XG3 supports   
   direct 6-bit register encodings, whereas RV+JX needs to use J21O prefixes.   
      
   The split needs to be maintained for plain RV regardless of register   
   pressure, since plain RV is incapable of handling non-default registers.   
      
      
   Though, BGBCC can treat X5..X7 as "stomp registers" and can use them to   
   fake instructions using registers across the divide (likewise for F0..F2   
   on the FPR side). This kinda sucks, but was needed as a consequence of   
   how BGBCC implemented its RV support.   
      
   So, as far as the BGBCC's register allocator are concerned, only X8..X31   
   and F8..F31 actually exist (well, with the added wonk that they are   
   internally remapped to their "equivalents" in the XG1/XG2 register space).   
      
   Decided to leave off mapping tables, and maybe an eventual TODO is to   
   redo the register allocator in a way that is "not stupid".   
      
      
   > Another way to approach this question is to look at the current   
   > champion of fixed instruction width, ARM A64, consider those   
   > instructions (and addressing modes) that ARM A64 has and RISC-V does   
   > not have, and look at how often they are used, and how many RISC-V   
   > instructions are needed to replace them.   
   >   
   > In any case, code density measurements show that both result in   
   > compact code, with RV64GC having more compact code, and actually   
   > having the most compact code among the architectures present in all   
   > rounds of my measurements where RV64GC was present.   
   >   
      
   In my own testing, XG1 can beat RV-C, but: They are "close enough".   
      
   I am more in favor at this point of mostly avoiding 16-bit ops when   
   possible as they have the downside of negatively effecting performance   
   in many cases (in ways that are inherently unavoidable).   
      
   Except for cases where size optimization is important (like, say, in the   
   Boot ROM), but then could just use RV-C as "good enough".   
      
      
   In most other cases, slightly better code density at the expense of some   
   performance isn't an ideal tradeoff. Like, for programs loaded into RAM   
   +/- some kB off the size of ".text" doesn't matter that much.   
      
      
      
   And, if the goal is "shortest instruction count", 32/64/96 bit is a   
   better bet. And, if performance is the goal, minimizing register use   
   conflicts will also be a goal, which means prioritizing 64 or so GPRs,   
   which can't really be used from a 16-bit encoding scheme.   
      
      
   > But code size is not everything.  For ARM A64, you pay for it by the   
   > increased complexity of implementing these instructions (in particular   
   > the many register ports) and addressing modes.  For bigger   
   > implementations, instruction combining means additional front-end   
   > effort for RISC-V, and then maybe similar implementation effort for   
   > the combined instructions as for ARM A64 (but more flexibility in   
   > selecting which instructions to combine).  And, as mentioned above,   
   > the additional decoding effort.   
   >   
      
   Yeah.   
      
   A64 maybe has some issues in the other way, as some of the addressing   
   modes are more complicated than ideal.   
      
   Things like ALU status flags aren't free either.   
      
   ...   
      
      
      
   If I were to try to rank addressing modes in terms of use/frequency   
   (assuming all exist):   
       1. [Rb+Disp]   
       2. [Rb+Ri*ElemSizedScale]   
       3. [Rb+Ri*1]   
       4. (Rb)+                //"*ptr++"   
       5. [Abs]                //"*((T*)FIXEDADDR)"   
       6. [Rb+Ri*Sc+Disp]      //"obj->arr[idx]"   
       7. -(Rb)                //"*--ptr"   
       8. +(Rb) and/or (Rb)-   //"*++ptr" and "*ptr--"   
       9. [Abs+Rb*ElemSizedScale]   
      10. [Abs+Rb*1]   
      
   If the "[Rb+Disp]" were subdivided, one would have, say:   
      1. [SP+Disp]  //Prolog/Epilog/Spill   
      2. [Rb]       //"*ptr"   
      3. [GP+Disp]  //Global Variable   
      4. [Rb+Disp]  //eg: "obj->field", etc   
      5. [TP+Disp]  //Context / TLS (much rarer)   
      
   Where, in this case, [GP+Disp] has the main special property than Disp   
   tends to be larger (at least in my compiler) than with most other   
   registers (mostly because GP is used to access global variables).   
      
   If GP+Disp were not used to access globals, this could shift to:   
      [PC+Disp]   //if using PC-rel for globals   
      [Abs]       //if using absolute addressing.   
      
   So, a lot does depend on compiler.   
      
      
   Even if supported, and usable by the compiler, auto-increment seems   
   uncommon. The dominant way it was used (in both SH and BJX1) was to   
   implement PUSH/POP. If using SP-rel instead, this use-pattern mostly   
   evaporates.   
      
   While still potentially used for things like "*ptr++" and similar, these   
   tend to themselves be relatively infrequent if compared with all the   
   other places load/store may appear.   
      
      
   Active usage frequency:   
   Boot   
      Load , [Rb+Disp]        :  50%   
      Store, [Rb+Disp]        :  40%   
      Load , [Rb+Ri*ElemScale]:   7%   
      Store, [Rb+Ri*ElemScale]:   1%   
      Everything Else         :   2%   
   Doom   
      Load , [Rb+Disp]        :  12%   
      Store, [Rb+Disp]        :  11%   
      Load , [Rb+Ri*ElemScale]:  67%   
      Store, [Rb+Ri*ElemScale]:   9%   
      Everything Else         :   1%   
      
      
      
   > When we look at actual implementations, RISC-V has not reached the   
   > widths that ARM A64 has reached, but I guess that this is more due to   
   > the current potential markets for these two architectures than due to   
   > technical issues.  RISC-V seems to be pushing into server space   
   > lately, so we may see wider implementations in the not-too-far future.   
   >   
      
   Possibly.   
      
   Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.   
      
   Major limitations here being more:   
      Things like register forwarding cost have non-linear scaling;   
      For an in-order machine, usable ILP drops off very rapidly;   
      ...   
      
   There seems to be a local optimum between 2 and 3.   
      
      
   Say, for example, if one had an in-order machine with 5 ALUs, one would   
   be hard pressed to find much code that could actually make use of the 5   
   ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is   
   more often useful for spare register ports and similar (with 3-wide ALU   
   being a minority case)   
      
   Apart from the occasionally highly unrolled and parallel integer code.   
      
   ...   
      
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]