From: already5chosen@yahoo.com   
      
   On Tue, 30 Dec 2025 07:36:44 GMT   
   anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:   
      
   > BGB writes:   
   > >On 12/29/2025 12:35 PM, Anton Ertl wrote:   
   > [...]   
   > >One usual downside is that to utilize a 16-bit ISA with a smaller   
   > >register space, one needs to reuse registers more frequently, which   
   > >then reduces ILP due to register conflicts. So, smaller code at the   
   > >expense of worse performance.   
   >   
   > For designs like RISC-V C and Thumb2, there is always the option to   
   > use the uncompressed instruction. So you may tune your RISC-V   
   > compiler to prefer registers r8-r15 for those pseudo-registers that   
   > occur in instructions where such a register allocation may lead to a   
   > compressed encoding.   
   >   
   > Write-after-read and write-after-write does not reduce the IPC of OoO   
   > implementations. On the contrary, write-after-read may be beneficial   
   > by releasing the old physical register for the register name. And   
   > designing a compressed CPU instruction set for in-order processing is   
   > not a good idea for general-purpose computing.   
   >   
   > >Things like ALU status flags aren't free either.   
   >   
   > Yes, they cost their own renaming resources.   
   >   
   > >Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.   
   > >   
   > >Major limitations here being more:   
   > > Things like register forwarding cost have non-linear scaling;   
   > > For an in-order machine, usable ILP drops off very rapidly;   
   > > ...   
   >   
   > ILP is a property of a program. I assume that what you mean is that   
   > the IPC benefits of more width have quickly diminishing returns on   
   > in-order machines.   
   >   
   > >There seems to be a local optimum between 2 and 3.   
   > >   
   > >   
   > >Say, for example, if one had an in-order machine with 5 ALUs, one   
   > >would be hard pressed to find much code that could actually make use   
   > >of the 5 ALUs. One can sorta make use of 3 ALUs, but even then, the   
   > >3rd lane is more often useful for spare register ports and similar   
   > >(with 3-wide ALU being a minority case)   
   >   
   > We have some interesting case studies: The Alpha 21164(a) and the ARM   
   > Cortex-A53 and A55. They all are in-order designs, their number of   
   > functional units are pretty similar, and, in particular, they all have   
   > 2 integer ALUs. But the 21164 can decode and execute 4 instructions   
   > per cycle, while the Cortex-A53 and A55 are only two-wide. My guess   
   > is that this is due to the decoding cost of ARM A32/T32 and A64   
   > (decoders for two instruction sets, one of which has 16-bit and 32-bit   
   > instructions).   
   >   
   > The Cortex-A55 was succeeded by the A510, which is three-wide, and   
   > that was succeeded by the A520, which is three-wide with two ALUs and   
   > supports only ARM A64.   
   >   
   > Widening the A510, which still supports both instruction sets is   
   > (weak) counterevidence for my theory about why A53/A55 are only   
   > two-wide at decoding. The fact that the A520 returns to two integer   
   > ALUs indicates that the third integer ALU provides little IPC benefit   
   > in an in-order design.   
   >   
   > - anton   
      
      
   Do you happen to have benchmarks that compare performance of Alpha EV5   
   vs in-order Cortex-A ?   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|