... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,597 of 131,241
BGB to Anton Ertl
Re: What is more important (2/2)
05 Sep 25 14:26:28
   [continued from previous message]   
      
   latency, well, your indexed load is still 3 cycles vs 5 cycles, but   
   still worse than 1 cycle...   
      
   And, statistically, indexed loads tend to be far too large of the   
   dynamic instructions mix to justify cheaping out here. Even if static   
   instruction counts make them seem less relevant, indexed loads also tend   
   to be more concentrated inside loops (whereas fixed-displacement loads   
   are more concentrated in prologs and epilogs). If one excludes the   
   prolog and epilog related loads/stores, the proportion of indexed   
   load/store goes up significantly.   
      
      
      
   >> At some scale, smaller code size is beneficial, but once the implementation   
   >> has a GBOoO µarchitecture, I would think that fewer instructions is better   
   >> than smaller code--so long as the code size is less than 150% of the smaller   
   >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).   
   >   
   > I don't think that even VAX encoding would be the major problem of the   
   > VAX these days.  There are microop caches and speculative decoders for   
   > that (although, as EricP points out, the VAX is an especially   
   > expensive nut to crack for a speculative decoder).   
   >   
      
   Well, if Intel and AMD could make x86 work... yeah...   
      
      
   > In any case, if smaller code size was it, RV64GC would win according   
   > to my results.  However, compilers often generate code that has a   
   > bigger code size rather than a smaller one (loop unrolling, inlining),   
   > so code size is not that important in the eyes of the maintainers of   
   > these compilers.   
   >   
      
   I haven't really tested, but I suspect one could improve over RV64GC   
   slightly here.   
      
      
   For example:   
      
   * 00in-nnnn-iiii-0000  ADD		Imm5s, Rn5  //"ADD 0, R0" = TRAP   
   * 01in-nnnn-iiii-0000  LI		Imm5s, Rn5   
   * 10mn-nnnn-mmmm-0000  ADD		Rm5, Rn5   
   * 11mn-nnnn-mmmm-0000  MV		Rm5, Rn5   
      
   * 0000-nnnn-iiii-0100  ADDW		Imm4u, Rn4   
   * 0001-nnnn-mmmm-0100  SUB		Rm4, Rn4   
   * 0010-nnnn-mmmm-0100  ADDW		Imm4n, Rn4   
   * 0011-nnnn-mmmm-0100  MVW		Rm4, Rn4 //ADDW  Rm, 0, Rn   
   * 0100-nnnn-mmmm-0100  ADDW		Rm4, Rn4   
   * 0101-nnnn-mmmm-0100  AND		Rm4, Rn4   
   * 0110-nnnn-mmmm-0100  OR		Rm4, Rn4   
   * 0111-nnnn-mmmm-0100  XOR		Rm4, Rn4   
      
   * 0iii-0nnn-0mmm-1001 ? SLL		Rm3, Imm3u, Rn3   
   * 0iii-0nnn-1mmm-1001 ? SRL		Rm3, Imm3u, Rn3   
   * 0iii-1nnn-0mmm-1001 ? ADD		Rm3, Imm3u, Rn3   
   * 0iii-1nnn-1mmm-1001 ? ADDW		Rm3, Imm3u, Rn3   
   * 1iii-0nnn-0mmm-1001 ? AND		Rm3, Imm3u, Rn3   
   * 1iii-0nnn-1mmm-1001 ? SRA		Rm3, Imm3u, Rn3   
   * 1iii-1nnn-0mmm-1001 ? ADD		Rm3, Imm3n, Rn3   
   * 1iii-1nnn-1mmm-1001 ? ADDW		Rm3, Imm3n, Rn3   
      
   * 0ooo-0nnn-0mmm-1101 ? SLL		Rm3, Ro3, Rn3   
   * 0ooo-0nnn-1mmm-1101 ? SRL		Rm3, Ro3, Rn3   
   * 0ooo-1nnn-0mmm-1101 ? AND		Rm3, Ro3, Rn3   
   * 0ooo-1nnn-1mmm-1101 ? SRA		Rm3, Ro3, Rn3   
   * 1ooo-0nnn-0mmm-1101 ? ADD		Rm3, Ro3, Rn3   
   * 1ooo-0nnn-1mmm-1101 ? SUB		Rm3, Ro3, Rn3   
   * 1ooo-1nnn-0mmm-1101 ? ADDW		Rm3, Ro3, Rn3   
   * 1ooo-1nnn-1mmm-1101 ? SUBW		Rm3, Ro3, Rn3   
      
   * 0ddd-nnnn-mmmm-0001  LW		Disp3u(Rm4), Rn4   
   * 1ddd-nnnn-mmmm-0001  LD		Disp3u(Rm4), Rn4   
   * 0ddd-nnnn-mmmm-0101  SW		Rn4, Disp3u(Rm4)   
   * 1ddd-nnnn-mmmm-0101  SD		Rn4, Disp3u(Rm4)   
      
   * 00dn-nnnn-dddd-1001  LW		Disp5u(SP), Rn5   
   * 01dn-nnnn-dddd-1001  LD		Disp5u(SP), Rn5   
   * 10dn-nnnn-dddd-1001  SW		Rn5, Disp5u(SP)   
   * 11dn-nnnn-dddd-1001  SD		Rn5, Disp5u(SP)   
      
   * 00dd-dddd-dddd-1101  J		Disp10   
   * 01dn-nnnn-dddd-1101  LD		Disp5u(SP), FRn5   
   * 10in-nnnn-iiii-1101  LUI		Imm5s, Rn5   
   * 11dn-nnnn-dddd-1101  SD		FRn5, Disp5u(SP)   
      
   Could achieve a higher average hit-rate than RV-C while *also* using   
   less encoding space.   
      
      
   Why? Partly because Reg4 (R8..R23) is less useless than Reg3 (R8..R15).   
      
   Less shift range, but shifts are over-represented in RV-C, and the   
   shifts that are present have a very low hit rate due to tending not to   
   match the patterns that tend to exist in the compiler output (unlike   
   ADD, shifts being far more likely to have different source and   
   destination registers).   
      
      
   The 3R/3RI instructions would still be limited to the "kinda useless"   
   3-bit registers, but this still isn't exactly worse than what is already   
   the case for RV-C (even if they still have a poor hit rate).   
      
   I left out things like ADDI16SP and ADDI4SPN and similar, as these   
   aren't frequent enough to be relevant here (nor do existing instances of   
   "ADD SP, Imm, Rn" tend to hit within the limitations of "ADDI4SPN", as   
   it is still borderline useless in BGBCC in this case, *1).   
      
      
   *1: The only times Reg3 has an OK hit rate is in leaf functions, and   
   there seems to be a strong negative correlation between leaf functions   
   and stack arrays. Also at best, the underlying instruction tends to have   
   a low hit-rate as, when a stack array is used semi-frequently, BGBCC   
   tends to end up loading the address into a register and leaving it there   
   for multiple uses (and, due to "quirks", if you access a local array in   
   an inner loop, it will tend to end up in the fixed-assignment case, in   
   which case the array address is loaded into a register one-off in the   
   prolog). The ADDI4SPN instruction only really makes sense if one assumes   
   that stack arrays are both very frequent (in leaf functions?) and/or   
   that the compiler tends to load the address of the array into a scratch   
   register and then immediately discard it again (neither of which seems   
   true in my case).   
      
   ADDI16SP would be relevant for prologs and epilogs, but has a   
   statistical incidence too low to really justify a 16 bit encoding (in   
   many cases, would only occur twice per function or so, which is   
   statistically, a fairly low incidence rate).   
      
   ...   
      
      
   Though, that said, RVC in BGBCC still does seem to be semi-effective   
   despite its limitations.   
      
      
      
   > I also often see code produced with more (dynamic) instructions than   
   > necessary.  So the number of instructions is apparently not that   
   > important, either.   
   >   
      
   Yeah, probably true.   
      
   Often it seems better to try to minimize instruction-instruction   
   dependency chains.   
      
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]