Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,673 of 131,241    |
|    BGB to Anton Ertl    |
|    Re: Variable-length instructions (2/2)    |
|    29 Dec 25 19:54:41    |
      [continued from previous message]              Despite XG2 having 64 GPRs, always enabling all 64 GPRs had a slight       negative effect on both code-density and performance (mostly to making       prologs/epilogs slightly larger on average, and increasing the size of       the average stack frame).              However, always using 32 GPRs was generally better for performance than       using 32 GPRs sparingly.                     For XG3, it is basically a similar scheme to RV+JX, namely:        Low/moderate pressure:        Assume X and F are split (int/ptr on X side; FPU on F side);        High pressure:        Merge the spaces.              Though, for XG3 vs RV+JX, it makes sense to use a lower threshold for       XG3, and a higher threshold for RV+JX. This is because XG3 supports       direct 6-bit register encodings, whereas RV+JX needs to use J21O prefixes.              The split needs to be maintained for plain RV regardless of register       pressure, since plain RV is incapable of handling non-default registers.                     Though, BGBCC can treat X5..X7 as "stomp registers" and can use them to       fake instructions using registers across the divide (likewise for F0..F2       on the FPR side). This kinda sucks, but was needed as a consequence of       how BGBCC implemented its RV support.              So, as far as the BGBCC's register allocator are concerned, only X8..X31       and F8..F31 actually exist (well, with the added wonk that they are       internally remapped to their "equivalents" in the XG1/XG2 register space).              Decided to leave off mapping tables, and maybe an eventual TODO is to       redo the register allocator in a way that is "not stupid".                     > Another way to approach this question is to look at the current       > champion of fixed instruction width, ARM A64, consider those       > instructions (and addressing modes) that ARM A64 has and RISC-V does       > not have, and look at how often they are used, and how many RISC-V       > instructions are needed to replace them.       >       > In any case, code density measurements show that both result in       > compact code, with RV64GC having more compact code, and actually       > having the most compact code among the architectures present in all       > rounds of my measurements where RV64GC was present.       >              In my own testing, XG1 can beat RV-C, but: They are "close enough".              I am more in favor at this point of mostly avoiding 16-bit ops when       possible as they have the downside of negatively effecting performance       in many cases (in ways that are inherently unavoidable).              Except for cases where size optimization is important (like, say, in the       Boot ROM), but then could just use RV-C as "good enough".                     In most other cases, slightly better code density at the expense of some       performance isn't an ideal tradeoff. Like, for programs loaded into RAM       +/- some kB off the size of ".text" doesn't matter that much.                            And, if the goal is "shortest instruction count", 32/64/96 bit is a       better bet. And, if performance is the goal, minimizing register use       conflicts will also be a goal, which means prioritizing 64 or so GPRs,       which can't really be used from a 16-bit encoding scheme.                     > But code size is not everything. For ARM A64, you pay for it by the       > increased complexity of implementing these instructions (in particular       > the many register ports) and addressing modes. For bigger       > implementations, instruction combining means additional front-end       > effort for RISC-V, and then maybe similar implementation effort for       > the combined instructions as for ARM A64 (but more flexibility in       > selecting which instructions to combine). And, as mentioned above,       > the additional decoding effort.       >              Yeah.              A64 maybe has some issues in the other way, as some of the addressing       modes are more complicated than ideal.              Things like ALU status flags aren't free either.              ...                            If I were to try to rank addressing modes in terms of use/frequency       (assuming all exist):        1. [Rb+Disp]        2. [Rb+Ri*ElemSizedScale]        3. [Rb+Ri*1]        4. (Rb)+ //"*ptr++"        5. [Abs] //"*((T*)FIXEDADDR)"        6. [Rb+Ri*Sc+Disp] //"obj->arr[idx]"        7. -(Rb) //"*--ptr"        8. +(Rb) and/or (Rb)- //"*++ptr" and "*ptr--"        9. [Abs+Rb*ElemSizedScale]        10. [Abs+Rb*1]              If the "[Rb+Disp]" were subdivided, one would have, say:        1. [SP+Disp] //Prolog/Epilog/Spill        2. [Rb] //"*ptr"        3. [GP+Disp] //Global Variable        4. [Rb+Disp] //eg: "obj->field", etc        5. [TP+Disp] //Context / TLS (much rarer)              Where, in this case, [GP+Disp] has the main special property than Disp       tends to be larger (at least in my compiler) than with most other       registers (mostly because GP is used to access global variables).              If GP+Disp were not used to access globals, this could shift to:        [PC+Disp] //if using PC-rel for globals        [Abs] //if using absolute addressing.              So, a lot does depend on compiler.                     Even if supported, and usable by the compiler, auto-increment seems       uncommon. The dominant way it was used (in both SH and BJX1) was to       implement PUSH/POP. If using SP-rel instead, this use-pattern mostly       evaporates.              While still potentially used for things like "*ptr++" and similar, these       tend to themselves be relatively infrequent if compared with all the       other places load/store may appear.                     Active usage frequency:       Boot        Load , [Rb+Disp] : 50%        Store, [Rb+Disp] : 40%        Load , [Rb+Ri*ElemScale]: 7%        Store, [Rb+Ri*ElemScale]: 1%        Everything Else : 2%       Doom        Load , [Rb+Disp] : 12%        Store, [Rb+Disp] : 11%        Load , [Rb+Ri*ElemScale]: 67%        Store, [Rb+Ri*ElemScale]: 9%        Everything Else : 1%                            > When we look at actual implementations, RISC-V has not reached the       > widths that ARM A64 has reached, but I guess that this is more due to       > the current potential markets for these two architectures than due to       > technical issues. RISC-V seems to be pushing into server space       > lately, so we may see wider implementations in the not-too-far future.       >              Possibly.              Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.              Major limitations here being more:        Things like register forwarding cost have non-linear scaling;        For an in-order machine, usable ILP drops off very rapidly;        ...              There seems to be a local optimum between 2 and 3.                     Say, for example, if one had an in-order machine with 5 ALUs, one would       be hard pressed to find much code that could actually make use of the 5       ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is       more often useful for spare register ports and similar (with 3-wide ALU       being a minority case)              Apart from the occasionally highly unrolled and parallel integer code.              ...                     > - anton              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca