Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,597 of 131,241    |
|    BGB to Anton Ertl    |
|    Re: What is more important (2/2)    |
|    05 Sep 25 14:26:28    |
      [continued from previous message]              latency, well, your indexed load is still 3 cycles vs 5 cycles, but       still worse than 1 cycle...              And, statistically, indexed loads tend to be far too large of the       dynamic instructions mix to justify cheaping out here. Even if static       instruction counts make them seem less relevant, indexed loads also tend       to be more concentrated inside loops (whereas fixed-displacement loads       are more concentrated in prologs and epilogs). If one excludes the       prolog and epilog related loads/stores, the proportion of indexed       load/store goes up significantly.                            >> At some scale, smaller code size is beneficial, but once the implementation       >> has a GBOoO µarchitecture, I would think that fewer instructions is better       >> than smaller code--so long as the code size is less than 150% of the smaller       >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).       >       > I don't think that even VAX encoding would be the major problem of the       > VAX these days. There are microop caches and speculative decoders for       > that (although, as EricP points out, the VAX is an especially       > expensive nut to crack for a speculative decoder).       >              Well, if Intel and AMD could make x86 work... yeah...                     > In any case, if smaller code size was it, RV64GC would win according       > to my results. However, compilers often generate code that has a       > bigger code size rather than a smaller one (loop unrolling, inlining),       > so code size is not that important in the eyes of the maintainers of       > these compilers.       >              I haven't really tested, but I suspect one could improve over RV64GC       slightly here.                     For example:              * 00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP       * 01in-nnnn-iiii-0000 LI Imm5s, Rn5       * 10mn-nnnn-mmmm-0000 ADD Rm5, Rn5       * 11mn-nnnn-mmmm-0000 MV Rm5, Rn5              * 0000-nnnn-iiii-0100 ADDW Imm4u, Rn4       * 0001-nnnn-mmmm-0100 SUB Rm4, Rn4       * 0010-nnnn-mmmm-0100 ADDW Imm4n, Rn4       * 0011-nnnn-mmmm-0100 MVW Rm4, Rn4 //ADDW Rm, 0, Rn       * 0100-nnnn-mmmm-0100 ADDW Rm4, Rn4       * 0101-nnnn-mmmm-0100 AND Rm4, Rn4       * 0110-nnnn-mmmm-0100 OR Rm4, Rn4       * 0111-nnnn-mmmm-0100 XOR Rm4, Rn4              * 0iii-0nnn-0mmm-1001 ? SLL Rm3, Imm3u, Rn3       * 0iii-0nnn-1mmm-1001 ? SRL Rm3, Imm3u, Rn3       * 0iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3u, Rn3       * 0iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3u, Rn3       * 1iii-0nnn-0mmm-1001 ? AND Rm3, Imm3u, Rn3       * 1iii-0nnn-1mmm-1001 ? SRA Rm3, Imm3u, Rn3       * 1iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3n, Rn3       * 1iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3n, Rn3              * 0ooo-0nnn-0mmm-1101 ? SLL Rm3, Ro3, Rn3       * 0ooo-0nnn-1mmm-1101 ? SRL Rm3, Ro3, Rn3       * 0ooo-1nnn-0mmm-1101 ? AND Rm3, Ro3, Rn3       * 0ooo-1nnn-1mmm-1101 ? SRA Rm3, Ro3, Rn3       * 1ooo-0nnn-0mmm-1101 ? ADD Rm3, Ro3, Rn3       * 1ooo-0nnn-1mmm-1101 ? SUB Rm3, Ro3, Rn3       * 1ooo-1nnn-0mmm-1101 ? ADDW Rm3, Ro3, Rn3       * 1ooo-1nnn-1mmm-1101 ? SUBW Rm3, Ro3, Rn3              * 0ddd-nnnn-mmmm-0001 LW Disp3u(Rm4), Rn4       * 1ddd-nnnn-mmmm-0001 LD Disp3u(Rm4), Rn4       * 0ddd-nnnn-mmmm-0101 SW Rn4, Disp3u(Rm4)       * 1ddd-nnnn-mmmm-0101 SD Rn4, Disp3u(Rm4)              * 00dn-nnnn-dddd-1001 LW Disp5u(SP), Rn5       * 01dn-nnnn-dddd-1001 LD Disp5u(SP), Rn5       * 10dn-nnnn-dddd-1001 SW Rn5, Disp5u(SP)       * 11dn-nnnn-dddd-1001 SD Rn5, Disp5u(SP)              * 00dd-dddd-dddd-1101 J Disp10       * 01dn-nnnn-dddd-1101 LD Disp5u(SP), FRn5       * 10in-nnnn-iiii-1101 LUI Imm5s, Rn5       * 11dn-nnnn-dddd-1101 SD FRn5, Disp5u(SP)              Could achieve a higher average hit-rate than RV-C while *also* using       less encoding space.                     Why? Partly because Reg4 (R8..R23) is less useless than Reg3 (R8..R15).              Less shift range, but shifts are over-represented in RV-C, and the       shifts that are present have a very low hit rate due to tending not to       match the patterns that tend to exist in the compiler output (unlike       ADD, shifts being far more likely to have different source and       destination registers).                     The 3R/3RI instructions would still be limited to the "kinda useless"       3-bit registers, but this still isn't exactly worse than what is already       the case for RV-C (even if they still have a poor hit rate).              I left out things like ADDI16SP and ADDI4SPN and similar, as these       aren't frequent enough to be relevant here (nor do existing instances of       "ADD SP, Imm, Rn" tend to hit within the limitations of "ADDI4SPN", as       it is still borderline useless in BGBCC in this case, *1).                     *1: The only times Reg3 has an OK hit rate is in leaf functions, and       there seems to be a strong negative correlation between leaf functions       and stack arrays. Also at best, the underlying instruction tends to have       a low hit-rate as, when a stack array is used semi-frequently, BGBCC       tends to end up loading the address into a register and leaving it there       for multiple uses (and, due to "quirks", if you access a local array in       an inner loop, it will tend to end up in the fixed-assignment case, in       which case the array address is loaded into a register one-off in the       prolog). The ADDI4SPN instruction only really makes sense if one assumes       that stack arrays are both very frequent (in leaf functions?) and/or       that the compiler tends to load the address of the array into a scratch       register and then immediately discard it again (neither of which seems       true in my case).              ADDI16SP would be relevant for prologs and epilogs, but has a       statistical incidence too low to really justify a 16 bit encoding (in       many cases, would only occur twice per function or so, which is       statistically, a fairly low incidence rate).              ...                     Though, that said, RVC in BGBCC still does seem to be semi-effective       despite its limitations.                            > I also often see code produced with more (dynamic) instructions than       > necessary. So the number of instructions is apparently not that       > important, either.       >              Yeah, probably true.              Often it seems better to try to minimize instruction-instruction       dependency chains.                     > - anton              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca