From: user5857@newsgrouper.org.invalid   
      
   Robert Finch posted:   
      
   > On 2025-12-31 12:12 p.m., MitchAlsup wrote:   
   > >   
   > > MitchAlsup posted:   
   > >   
   > >>   
   > >> BGB posted:   
   > >>   
   > >>> On 12/30/2025 1:36 AM, Anton Ertl wrote:   
   > >>>> BGB writes:   
   > >>>>> On 12/29/2025 12:35 PM, Anton Ertl wrote:   
   > >>>> [...]   
   > >>>>> One usual downside is that to utilize a 16-bit ISA with a smaller   
   > >>>>> register space, one needs to reuse registers more frequently, which   
   then   
   > >>>>> reduces ILP due to register conflicts. So, smaller code at the expense   
   > >>>>> of worse performance.   
   > >>>>   
   > >>>> For designs like RISC-V C and Thumb2, there is always the option to   
   > >>>> use the uncompressed instruction. So you may tune your RISC-V   
   > >>>> compiler to prefer registers r8-r15 for those pseudo-registers that   
   > >>>> occur in instructions where such a register allocation may lead to a   
   > >>>> compressed encoding.   
   > >>>>   
   > >>>> Write-after-read and write-after-write does not reduce the IPC of OoO   
   > >>>> implementations. On the contrary, write-after-read may be beneficial   
   > >>>> by releasing the old physical register for the register name. And   
   > >>>> designing a compressed CPU instruction set for in-order processing is   
   > >>>> not a good idea for general-purpose computing.   
   > >>>>   
   > >>>   
   > >>> Though, the main places where compressed instructions are likely to   
   > >>> bring meaningful benefit, is on small in-order machines.   
   > >>   
   > >> Coincidentally; this is exactly where a fatter-ISA wins big::   
   > >> compare::   
   > >>   
   > >> LDD R7,[IP,R3<<3,DISP32]   
   > >>   
   > >> 1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against   
   > >   
   > > It is only 2 words   
   > >   
   > >> AUPIC Rt,lo(DISP32)   
   > >> SLL Ri,R3,#3   
   > >> ADD Rt,Rt,hi(DISP32)   
   > >> ADD Rt,Rt,Ri   
   > >> LDD R7,0(Rt)   
   > >>   
   > >> 5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.   
   > >   
   > > This should be::   
   > >   
   > > AUPIC Rt,hi(DISP32)   
   > > SLL Ri,R3,#3   
   > > ADD Rt,Rt,Ri   
   > > LDD R7,lo(DISP32)(Rt)   
   > >   
   > > 4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum   
   >   
   > An even fatter ISA (Qupls4) in theory:   
   >   
   > LOAD r7, disp56(ip+r3*8)   
      
   I could have shown the DISP64 version--3-words   
      
   > 1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum   
   >   
   > The ISA is becoming a bit more stable now; the latest change was for   
   > constant postfix instructions. Qupls used to have a somewhat convoluted   
   > means of addressing constants on the cache-line. Now it’s just   
   > postfixes. The constant routing information is in the postfix now which   
   > uses four bits. Two to select a register override, two to select   
   > constant quadrant. So, postfixes extend constants in the instruction (or   
   > previous postfix) by 36 bits.   
   >   
   > Qupls can do   
   > ADD r7, r8, $64_bit_constant   
   >   
   > Using only two words (96 bits) and just a single cycle.   
      
   So can My 66000, but everyone and his brother thinks 96-bits is 3 words.   
      
   > I prefer to use multiply ‘*’ rather than shift in scaled indexed   
   > addressing as a couple of CPUs had multiply by five and ten in addition   
   > to 1,2,4,8. What if one wants to scale by 3?   
      
   If you have the bits, why not.   
      
   > It is also possible to encode 128-bit constants, but the current   
   > implementation does not support them.   
   >   
   > Managed to get to some early synthesis trials and found the instruction   
   > dispatch to be on the critical timing path. I am a bit stumped as to how   
   > to improve it as it is very simple already. It just copies from one set   
   > of pipeline registers to another headed towards the reservation   
   > stations. Tools report timing good to 37 MHz, I was shooting for at   
   > least 40.   
   >   
   > Found a couple of spots where the code was simple but too slow. One in   
   > dynamic register selection. The code was packing the register selections   
   > to a minimum. But that was way too many logic levels.   
      
   Those are some of the driving inputs to "An architecture is as much about   
   what gets left out as what gets put in."   
      
   > It is quite an art to get something working in minimum clock cycles and   
   > fast clock frequency.   
   >   
   > >>   
   > >>> Any OoO machine is also likely to have a lot of RAM and a decent sized   
   > >>> I$, so much of any benefit is likely to go away in this case.   
   > >>   
   > >> s/go away/greatly ameliorated/   
   > >>   
   > >> ------------------------   
   > >>>> ILP is a property of a program. I assume that what you mean is that   
   > >>>> the IPC benefits of more width have quickly diminishing returns on   
   > >>>> in-order machines.   
   > >>>>   
   > >>>   
   > >>> The ILP is a property of the code, yes, but how much exists, and how   
   > >>> much of it is actually usable, is effected by the processor   
   implementation.   
   > >>   
   > >> I agree that ILP is more aligned with code than with program.   
   > >> {see above example where 1 instruction does the work of 5}   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|