From: cr88192@gmail.com   
      
   On 1/1/2026 12:13 PM, MitchAlsup wrote:   
   >   
   > Robert Finch posted:   
   >   
   >> On 2025-12-31 12:12 p.m., MitchAlsup wrote:   
   >>>   
   >>> MitchAlsup posted:   
   >>>   
   >>>>   
   >>>> BGB posted:   
   >>>>   
   >>>>> On 12/30/2025 1:36 AM, Anton Ertl wrote:   
   >>>>>> BGB writes:   
   >>>>>>> On 12/29/2025 12:35 PM, Anton Ertl wrote:   
   >>>>>> [...]   
   >>>>>>> One usual downside is that to utilize a 16-bit ISA with a smaller   
   >>>>>>> register space, one needs to reuse registers more frequently, which   
   then   
   >>>>>>> reduces ILP due to register conflicts. So, smaller code at the expense   
   >>>>>>> of worse performance.   
   >>>>>>   
   >>>>>> For designs like RISC-V C and Thumb2, there is always the option to   
   >>>>>> use the uncompressed instruction. So you may tune your RISC-V   
   >>>>>> compiler to prefer registers r8-r15 for those pseudo-registers that   
   >>>>>> occur in instructions where such a register allocation may lead to a   
   >>>>>> compressed encoding.   
   >>>>>>   
   >>>>>> Write-after-read and write-after-write does not reduce the IPC of OoO   
   >>>>>> implementations. On the contrary, write-after-read may be beneficial   
   >>>>>> by releasing the old physical register for the register name. And   
   >>>>>> designing a compressed CPU instruction set for in-order processing is   
   >>>>>> not a good idea for general-purpose computing.   
   >>>>>>   
   >>>>>   
   >>>>> Though, the main places where compressed instructions are likely to   
   >>>>> bring meaningful benefit, is on small in-order machines.   
   >>>>   
   >>>> Coincidentally; this is exactly where a fatter-ISA wins big::   
   >>>> compare::   
   >>>>   
   >>>> LDD R7,[IP,R3<<3,DISP32]   
   >>>>   
   >>>> 1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against   
   >>>   
   >>> It is only 2 words   
   >>>   
   >>>> AUPIC Rt,lo(DISP32)   
   >>>> SLL Ri,R3,#3   
   >>>> ADD Rt,Rt,hi(DISP32)   
   >>>> ADD Rt,Rt,Ri   
   >>>> LDD R7,0(Rt)   
   >>>>   
   >>>> 5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.   
   >>>   
   >>> This should be::   
   >>>   
   >>> AUPIC Rt,hi(DISP32)   
   >>> SLL Ri,R3,#3   
   >>> ADD Rt,Rt,Ri   
   >>> LDD R7,lo(DISP32)(Rt)   
   >>>   
   >>> 4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum   
   >>   
   >> An even fatter ISA (Qupls4) in theory:   
   >>   
   >> LOAD r7, disp56(ip+r3*8)   
   >   
   > I could have shown the DISP64 version--3-words   
   >   
      
   At 64-bits, displacements cease to make sense as a displacement.   
   Seems to make more sense to interpret these as [Abs64+Rb] rather than   
   [Rb+Disp64].   
      
   Except, then I have to debate what exactly I would do if I decide to   
   allow this case in XG2/XG3.   
      
      
   As noted:   
    [Rb+Disp10]: Usually scaled (excluding some special-cases);   
    [Rb+Disp33]: Either scaled or unscaled.   
    BGBCC is typically using unscaled displacements in this case.   
    Uscaled range, +/- 4GB   
    DW: +/- 16GB, QW: +/- 32GB   
    XG2 and XG3 effectively left 1 bit extra, which indicates scale.   
    0: Scaled by element size;   
    1: Unscaled.   
    [Rb+Disp64]: Would be understood as unscaled.   
    TBD: Scale register (more likely to be useful, breaks symmetry);   
    Unscaled register, preserves symmetry, but less likely useful.   
    Would be consistent with the handling of RISC-V,   
    which is always unscaled in this case.   
    May be moot, as plain Abs64 would be the dominant case here.   
      
      
   >> 1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum   
   >>   
   >> The ISA is becoming a bit more stable now; the latest change was for   
   >> constant postfix instructions. Qupls used to have a somewhat convoluted   
   >> means of addressing constants on the cache-line. Now it’s just   
   >> postfixes. The constant routing information is in the postfix now which   
   >> uses four bits. Two to select a register override, two to select   
   >> constant quadrant. So, postfixes extend constants in the instruction (or   
   >> previous postfix) by 36 bits.   
   >>   
   >> Qupls can do   
   >> ADD r7, r8, $64_bit_constant   
   >>   
   >> Using only two words (96 bits) and just a single cycle.   
   >   
   > So can My 66000, but everyone and his brother thinks 96-bits is 3 words.   
   >   
      
   So can XG2 and XG3.   
    And, now, can add RV+JX to this category.   
      
   Though, I am likely to still consider 96-bit ops as an extension of JX   
   (as supporting them would be a much bigger burden on a 2-wide machine   
   with a 64-bit instruction fetch; would require a 2-wide machine to still   
   support 96-bit fetch).   
      
   Well, and then there is another issue:   
   RV64GC + 96-bit encodings, reintroduces another potential problem that   
   existed in XG1:   
   At certain alignments, the 96-bit fetch can cross a boundary of 2   
   half-line fetches with a 16B line size.   
      
   Say, one letter per 16-bit word:   
    AAAA-BBBB //Line A   
    CCCC-DDDD //Line B   
   Then (low 4b of PC):   
    0: AAAABB   
    2: AAABBB   
    4: AABBBB   
    6: ABBBBC //Violates two half-lines   
    8: BBBBCC   
    A: BBBCCC   
    C: BBCCCC   
    E: BCCCCD //Violates two half-lines   
      
   Granted, the partial workaround is to fetch 144 bits internally (16-bits   
   past the end of the half-line); which does technically "fix" the problem   
   as far as architecturally-visible behavior is concerned.   
      
   Or, just use the same "small brain" trick that BGBCC had used:   
   If free-form variable length instructions, insert a NOP pad if we step   
   on this turd;   
   Or, for code sequences where this turd would be unavoidable (running   
   through the WEXifier): Realign to 32 bits before entering WEX encoding   
   (scenario can't happen if 32-bit aligned).   
      
      
   Arguably, the latter scenario wouldn't have applied to RISC-V (and my JX   
   encodings), except that (very recently) I did end up expanding BGBCC's   
   WEXifier mechanism to cover RISC-V and XG3 (even if its role is slightly   
   different in this case), but does technically reintroduce the issue it   
   targeting RV64GC.   
      
   Though, currently, it is only enabled for RV64 if using RV64G and speed   
   optimization.   
      
   In this case, since RV64G and XG3 don't use explicit bundling, its role   
   is instead to shuffle instructions to try to optimize how they fit in   
   the pipeline.   
      
      
      
   >> I prefer to use multiply ‘*’ rather than shift in scaled indexed   
   >> addressing as a couple of CPUs had multiply by five and ten in addition   
   >> to 1,2,4,8. What if one wants to scale by 3?   
   >   
   > If you have the bits, why not.   
   >   
      
   Higher resource cost and latency is a concern...   
      
      
   >> It is also possible to encode 128-bit constants, but the current   
   >> implementation does not support them.   
   >>   
   >> Managed to get to some early synthesis trials and found the instruction   
   >> dispatch to be on the critical timing path. I am a bit stumped as to how   
   >> to improve it as it is very simple already. It just copies from one set   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|