From: anton@mips.complang.tuwien.ac.at   
      
   BGB writes:   
   >On 12/29/2025 12:35 PM, Anton Ertl wrote:   
   [...]   
   >One usual downside is that to utilize a 16-bit ISA with a smaller   
   >register space, one needs to reuse registers more frequently, which then   
   >reduces ILP due to register conflicts. So, smaller code at the expense   
   >of worse performance.   
      
   For designs like RISC-V C and Thumb2, there is always the option to   
   use the uncompressed instruction. So you may tune your RISC-V   
   compiler to prefer registers r8-r15 for those pseudo-registers that   
   occur in instructions where such a register allocation may lead to a   
   compressed encoding.   
      
   Write-after-read and write-after-write does not reduce the IPC of OoO   
   implementations. On the contrary, write-after-read may be beneficial   
   by releasing the old physical register for the register name. And   
   designing a compressed CPU instruction set for in-order processing is   
   not a good idea for general-purpose computing.   
      
   >Things like ALU status flags aren't free either.   
      
   Yes, they cost their own renaming resources.   
      
   >Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.   
   >   
   >Major limitations here being more:   
   > Things like register forwarding cost have non-linear scaling;   
   > For an in-order machine, usable ILP drops off very rapidly;   
   > ...   
      
   ILP is a property of a program. I assume that what you mean is that   
   the IPC benefits of more width have quickly diminishing returns on   
   in-order machines.   
      
   >There seems to be a local optimum between 2 and 3.   
   >   
   >   
   >Say, for example, if one had an in-order machine with 5 ALUs, one would   
   >be hard pressed to find much code that could actually make use of the 5   
   >ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is   
   >more often useful for spare register ports and similar (with 3-wide ALU   
   >being a minority case)   
      
   We have some interesting case studies: The Alpha 21164(a) and the ARM   
   Cortex-A53 and A55. They all are in-order designs, their number of   
   functional units are pretty similar, and, in particular, they all have   
   2 integer ALUs. But the 21164 can decode and execute 4 instructions   
   per cycle, while the Cortex-A53 and A55 are only two-wide. My guess   
   is that this is due to the decoding cost of ARM A32/T32 and A64   
   (decoders for two instruction sets, one of which has 16-bit and 32-bit   
   instructions).   
      
   The Cortex-A55 was succeeded by the A510, which is three-wide, and   
   that was succeeded by the A520, which is three-wide with two ALUs and   
   supports only ARM A64.   
      
   Widening the A510, which still supports both instruction sets is   
   (weak) counterevidence for my theory about why A53/A55 are only   
   two-wide at decoding. The fact that the A520 returns to two integer   
   ALUs indicates that the third integer ALU provides little IPC benefit   
   in an in-order design.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
    Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|