... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,596 of 131,241
BGB to Anton Ertl
Re: What is more important (1/2)
05 Sep 25 14:26:28
   From: cr88192@gmail.com   
      
   On 9/5/2025 10:03 AM, Anton Ertl wrote:   
   > MitchAlsup  writes:   
   >>   
   >> MitchAlsup  posted:   
   >>   
   >>>   
   >>> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>>> #include    
   >>>>   
   >>>> long arrays(long *v, size_t n)   
   >>>> {   
   >>>>    long i, r;   
   >>>>    for (i=0, r=0; i>>>      r+=v[i];   
   >>>>    return r;   
   >>>> }   
   >>>>   
   >>>> long a, b, c, d;   
   >>>>   
   >>>> void globals(void)   
   >>>> {   
   >>>>    a = 0x1234567890abcdefL;   
   >>>>    b = 0xcdef1234567890abL;   
   >>>>    c = 0x567890abcdef1234L;   
   >>>>    d = 0x5678901234abcdefL;   
   >>>> }   
   >>>   
   >>>> So, the overall sizes (including data size for globals() on RV64GC) are:   
   >>>>      Bytes                         Instructions   
   >>>> arrays globals    Architecture  arrays    globals   
   >>>> 28     66 (34+32) RV64GC            12          9   
   >>>> 27     69         AMD64             11          9   
   >>>> 44     84         ARM A64           11         22   
   >>>    32     68         My 66000           8          5   
   >>   
   >> In light of the above, what do people think is more important, small   
   >> code size or fewer instructions ??   
   >   
   > Performance from a given chip area.   
   >   
   > The RISC-V people argue that they can combine instructions with a few   
   > transistors.  But, OTOH, they have 16-bit and 32-bit wide   
   > instructions, which means that a part of the decoder results will be   
   > thrown away, increasing the decode cost for a given number of average   
   > decoded instructions per cycle.  Plus, they need more decoded   
   > instructions per cycle for a given amount of performance.   
   >   
   > Intel and AMD demonstrate that you can get high performance even with   
   > an instruction set that is even worse for decoding, but that's not cheap.   
   >   
   > ARM A64 goes the other way: Fixed-width instructions ensure that all   
   > decoding on correctly predicted paths is actually useful.   
   >   
   > However, it pays for this in other ways: Instructions like load pair   
   > with auto-increment need to write 3 registers, and the write port   
   > arbitration certainly has a hardware cost.  However, such an   
   > instruction would need two loads and an add if expressed in RISC-V; if   
   > RISC-V combines these instructions, it has the same write-port   
   > arbitration problem.  If it does not combine at least the loads, it   
   > will tend to perform worse with the same number of load/store units.   
   >   
   > So it's a balancing game: If you lose some weight here, do you need to   
   > add the same, more, or less weight elsewhere to compensate for the   
   > effects elsewhere?   
   >   
      
   It is tradeoffs...   
      
   Load/Store Pair helps, and isn't too bad if one already has the register   
   ports (if it is at least a 2-wide superscalar, you can afford it with   
   little additional cost).   
      
   Auto-increment slightly helps with code density, but is a net negative   
   in other ways. Depending on implementation, some of its more obvious   
   use-cases (such as behaving like a PUSH/POP) may end up slower than   
   using separate instructions.   
      
      
   Say, the most obvious way to implement auto-increment in my case would   
   be to likely have the instruction decode as if there were an implicit   
   ADD being executed in parallel.   
      
   Say:   
      MOV.Q (R10)+, R18   
      MOV.Q R19, -(R11)   
   Behaving as:   
      ADD  8, R10 | MOV.Q  (R10), R18   
      ADD -8, R11 | MOV.Q  R19, -8(R11)   
   So far, so good... Both execute with a 1-cycle latency, but...   
      MOV.Q  R18, -(R2)   
      MOV.Q  R19, -(R2)   
      MOV.Q  R20, -(R2)   
      MOV.Q  R21, -(R2)   
   Would take 8 cycles rather than 4 (due to R2 dependencies).   
      
   Vs:   
      MOV.Q  R18, -8(R2)  //*1   
      MOV.Q  R19, -16(R2)   
      MOV.Q  R20, -24(R2)   
      MOV.Q  R21, -32(R2)   
      ADD    -32, R2   
   Needing 5 cycles (or, maybe 4, if the superscalar logic is clever and   
   can run the ADD in parallel with the final MOV.Q).   
      
   *1: Where, "-8(R2)" and "(R2, -8)" are analogous as far as BGBCC's ASM   
   parser are concerned, but the former is more traditional, so figured I   
   would use it here.   
      
      
   Likewise, in C if you write:   
      v0=*cs++;   
      v1=*cs++;   
   And it were compiled as auto-increment loads, it could also end up   
   slower than a Load+Load+ADD sequence (for the same reason).   
      
   But, what about:   
      v0=*cs++;   
      //... something else unrelated to cs (or v0).   
   Well, then the ADD gets executed in parallel with whatever follows, so   
   may still work out to a 1-cycle latency in this case.   
      
      
   And, a feature is not particularly compelling when its main obvious use   
   cases would end up with little/no performance gain (or would actually   
   end up slower than what one does in its absence).   
      
   Only really works if one has a 1-cycle ADD.   
      
   Where, otherwise, seemingly the only real advantage of auto-increment   
   being to make the binaries slightly smaller.   
      
      
   Wouldn't take much to re-add it though, as noted, the ancestor of the   
   current backend was written for an ISA that did have auto-increment. I   
   just sort of ended up dropping it as it wasn't really worth it. Not only   
   was it not particularly effective, but tended to be a lot further down   
   the ranking in terms of usage frequency of addressing modes. If one   
   excludes using it for PUSH/POP, its usage frequency basically falls to   
   "hardly anything". Otherwise, you can basically count how many times you   
   see "*ptr++" or similar in C, this is about all it would ever end up   
   being used; which even in C, is often relatively infrequent).   
      
      
      
      
   But, yeah, can noted, the major areas where RISC-V tends to lose out   
   IMHO are:   
      Lack of Indexed Load/Store;   
      Crappy handling of large constants and lack of large immediate values.   
      
   I had noted before, that the specific combination of adding these features:   
      Indexed Load/Store;   
      Load/Store Pair;   
      Jumbo Prefixes.   
   Both improves code density over plain RV64G/RV64GC, and also gains a   
   roughly 40-50% speedup in programs like Doom.   
      
   While not significantly increasing logic cost over what one would   
   already need for a 2-wide machine. Could make sense to skip them for a   
   1-wide machine, but then you don't really care too much about   
   performance if going for 1-wide.   
      
      
   Then again, Indexed Load/Store, due to a "register port issue" for   
   Indexed Store, does give a performance advantage to going 3 wide over 2   
   wide even if the 3rd lane is rarely used otherwise.   
      
      
   Though, one could argue:   
   But, the relative delta (of assuming these features, over plain RV64GC)   
   is slightly less if one assumes a CPU with 1 cycle latency on ALU   
   instructions and similar. But, this is still kind of weak IMO (ideally,   
   the latency cost of ADD and similar should effect everything equally,   
   and that 2-cycle ADD and Shifts disproportionately hurts RV64G/RV64GC   
   performance, is not to RV64G's merit).   
      
   Well, and Zba helps, but not fully. If SHnADD still still has a 2c   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]