From: cr88192@gmail.com   
      
   On 9/5/2025 10:03 AM, Anton Ertl wrote:   
   > MitchAlsup writes:   
   >>   
   >> MitchAlsup posted:   
   >>   
   >>>   
   >>> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>>> #include    
   >>>>   
   >>>> long arrays(long *v, size_t n)   
   >>>> {   
   >>>> long i, r;   
   >>>> for (i=0, r=0; i>>> r+=v[i];   
   >>>> return r;   
   >>>> }   
   >>>>   
   >>>> long a, b, c, d;   
   >>>>   
   >>>> void globals(void)   
   >>>> {   
   >>>> a = 0x1234567890abcdefL;   
   >>>> b = 0xcdef1234567890abL;   
   >>>> c = 0x567890abcdef1234L;   
   >>>> d = 0x5678901234abcdefL;   
   >>>> }   
   >>>   
   >>>> So, the overall sizes (including data size for globals() on RV64GC) are:   
   >>>> Bytes Instructions   
   >>>> arrays globals Architecture arrays globals   
   >>>> 28 66 (34+32) RV64GC 12 9   
   >>>> 27 69 AMD64 11 9   
   >>>> 44 84 ARM A64 11 22   
   >>> 32 68 My 66000 8 5   
   >>   
   >> In light of the above, what do people think is more important, small   
   >> code size or fewer instructions ??   
   >   
   > Performance from a given chip area.   
   >   
   > The RISC-V people argue that they can combine instructions with a few   
   > transistors. But, OTOH, they have 16-bit and 32-bit wide   
   > instructions, which means that a part of the decoder results will be   
   > thrown away, increasing the decode cost for a given number of average   
   > decoded instructions per cycle. Plus, they need more decoded   
   > instructions per cycle for a given amount of performance.   
   >   
   > Intel and AMD demonstrate that you can get high performance even with   
   > an instruction set that is even worse for decoding, but that's not cheap.   
   >   
   > ARM A64 goes the other way: Fixed-width instructions ensure that all   
   > decoding on correctly predicted paths is actually useful.   
   >   
   > However, it pays for this in other ways: Instructions like load pair   
   > with auto-increment need to write 3 registers, and the write port   
   > arbitration certainly has a hardware cost. However, such an   
   > instruction would need two loads and an add if expressed in RISC-V; if   
   > RISC-V combines these instructions, it has the same write-port   
   > arbitration problem. If it does not combine at least the loads, it   
   > will tend to perform worse with the same number of load/store units.   
   >   
   > So it's a balancing game: If you lose some weight here, do you need to   
   > add the same, more, or less weight elsewhere to compensate for the   
   > effects elsewhere?   
   >   
      
   It is tradeoffs...   
      
   Load/Store Pair helps, and isn't too bad if one already has the register   
   ports (if it is at least a 2-wide superscalar, you can afford it with   
   little additional cost).   
      
   Auto-increment slightly helps with code density, but is a net negative   
   in other ways. Depending on implementation, some of its more obvious   
   use-cases (such as behaving like a PUSH/POP) may end up slower than   
   using separate instructions.   
      
      
   Say, the most obvious way to implement auto-increment in my case would   
   be to likely have the instruction decode as if there were an implicit   
   ADD being executed in parallel.   
      
   Say:   
    MOV.Q (R10)+, R18   
    MOV.Q R19, -(R11)   
   Behaving as:   
    ADD 8, R10 | MOV.Q (R10), R18   
    ADD -8, R11 | MOV.Q R19, -8(R11)   
   So far, so good... Both execute with a 1-cycle latency, but...   
    MOV.Q R18, -(R2)   
    MOV.Q R19, -(R2)   
    MOV.Q R20, -(R2)   
    MOV.Q R21, -(R2)   
   Would take 8 cycles rather than 4 (due to R2 dependencies).   
      
   Vs:   
    MOV.Q R18, -8(R2) //*1   
    MOV.Q R19, -16(R2)   
    MOV.Q R20, -24(R2)   
    MOV.Q R21, -32(R2)   
    ADD -32, R2   
   Needing 5 cycles (or, maybe 4, if the superscalar logic is clever and   
   can run the ADD in parallel with the final MOV.Q).   
      
   *1: Where, "-8(R2)" and "(R2, -8)" are analogous as far as BGBCC's ASM   
   parser are concerned, but the former is more traditional, so figured I   
   would use it here.   
      
      
   Likewise, in C if you write:   
    v0=*cs++;   
    v1=*cs++;   
   And it were compiled as auto-increment loads, it could also end up   
   slower than a Load+Load+ADD sequence (for the same reason).   
      
   But, what about:   
    v0=*cs++;   
    //... something else unrelated to cs (or v0).   
   Well, then the ADD gets executed in parallel with whatever follows, so   
   may still work out to a 1-cycle latency in this case.   
      
      
   And, a feature is not particularly compelling when its main obvious use   
   cases would end up with little/no performance gain (or would actually   
   end up slower than what one does in its absence).   
      
   Only really works if one has a 1-cycle ADD.   
      
   Where, otherwise, seemingly the only real advantage of auto-increment   
   being to make the binaries slightly smaller.   
      
      
   Wouldn't take much to re-add it though, as noted, the ancestor of the   
   current backend was written for an ISA that did have auto-increment. I   
   just sort of ended up dropping it as it wasn't really worth it. Not only   
   was it not particularly effective, but tended to be a lot further down   
   the ranking in terms of usage frequency of addressing modes. If one   
   excludes using it for PUSH/POP, its usage frequency basically falls to   
   "hardly anything". Otherwise, you can basically count how many times you   
   see "*ptr++" or similar in C, this is about all it would ever end up   
   being used; which even in C, is often relatively infrequent).   
      
      
      
      
   But, yeah, can noted, the major areas where RISC-V tends to lose out   
   IMHO are:   
    Lack of Indexed Load/Store;   
    Crappy handling of large constants and lack of large immediate values.   
      
   I had noted before, that the specific combination of adding these features:   
    Indexed Load/Store;   
    Load/Store Pair;   
    Jumbo Prefixes.   
   Both improves code density over plain RV64G/RV64GC, and also gains a   
   roughly 40-50% speedup in programs like Doom.   
      
   While not significantly increasing logic cost over what one would   
   already need for a 2-wide machine. Could make sense to skip them for a   
   1-wide machine, but then you don't really care too much about   
   performance if going for 1-wide.   
      
      
   Then again, Indexed Load/Store, due to a "register port issue" for   
   Indexed Store, does give a performance advantage to going 3 wide over 2   
   wide even if the 3rd lane is rarely used otherwise.   
      
      
   Though, one could argue:   
   But, the relative delta (of assuming these features, over plain RV64GC)   
   is slightly less if one assumes a CPU with 1 cycle latency on ALU   
   instructions and similar. But, this is still kind of weak IMO (ideally,   
   the latency cost of ADD and similar should effect everything equally,   
   and that 2-cycle ADD and Shifts disproportionately hurts RV64G/RV64GC   
   performance, is not to RV64G's merit).   
      
   Well, and Zba helps, but not fully. If SHnADD still still has a 2c   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|