From: user5857@newsgrouper.org.invalid   
      
   BGB posted:   
      
   > On 1/2/2026 12:48 PM, MitchAlsup wrote:   
   > > -----merciful snip----------   
   > > I have heard arguments in both directions::   
   > >   
   > > a) DISP64 only contains 33-bits of actual information   
   > > b) If DISP64 is absolute do you still need Rbase ??   
   > > when you have Rindex< > c) how can the HW KNOW ?!?   
   > > ------------------   
   > >>>> I prefer to use multiply ‘*’ rather than shift in scaled indexed   
   > >>>> addressing as a couple of CPUs had multiply by five and ten in addition   
   > >>>> to 1,2,4,8. What if one wants to scale by 3?   
   > >>>   
   > >>> If you have the bits, why not.   
   > >>>   
   > >>   
   > >> Higher resource cost and latency is a concern...   
   > >   
   > > Yes, your design is living on the edge.   
      
   This, BTW is a compliment--the best an architect can do is make every   
   stage of the pipeline have the same delay !!   
      
   > I am not sure how it would be pulled off for larger displacements or   
   > more general scales.   
      
   Better adder technology. We routinely pound an 11-gate adder into the   
   delay of 8×Fan4 gate delays.   
   ------------------------------------   
   > Say:   
   > void _mem_cpy16bytes(void *dst, void *src)   
   > {   
   > byte *cs, *ct;   
   > cs=src; ct=dst;   
   > ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3];   
   > ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7];   
   > ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11];   
   > ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15];   
   > }   
   > Is, slow...   
   Better ISA::   
      
    MM Rto,Rfrom,#16   
      
   and let HW do all the tricky/cool stuff--just make sure if you put it   
   in you fully support all the cool/tricky stuff.   
      
   > The store-to-load forwarding penalty being because LZ4 decompression   
   > often involves copying memory on top of itself, and the possible   
   > workarounds for this issue only offer competitive performance for blocks   
   > that are much longer than the typical copy (in the common case of a   
   > match under 20 bytes, it often being faster to just copy bytes and eat   
   > the cost).   
   >   
   > if(dist>=16)   
   > {   
   > if(len>=20)   
   > { more generalized/faster copy }   
   > else   
   > { just copy 20 bytes. }   
   > }else   
   > {   
   > if(len>=20)   
   > { generate pattern and fill with stride }   
   > else   
   > { copy 20 bytes over itself. }   
   > }   
      
   This is a problem easier solved in HW than in source code.   
      
   > >   
   > > For reasons like this, I only have   
   > >   
   > > CALL DISP26<<2 // call through DECODE   
   > > and   
   > > CALX [*address] // call through table   
   > > and   
   > > CALA [address] // call through AGEN   
   > >   
   > > which prevents compiler and assembler abuse.   
   >   
   >   
   > They went and defined that you can use any register as a link register,   
      
   Another case where they screwed up.....   
      
   > but in practice there is basically no reason to use alternative link   
   > registers. ASM programmer people could do so, but not seen all that much   
   > evidence of this being a thing thus far.   
      
   In Mc88k we recognized (and made compiler follow)   
    JMP R1 // return from subroutine   
    JMP ~R1 // switch   
   -------------------   
   > Well, say, vs my approach:   
   > LD X1, Disp(SP); ....; JALR X0, 0(X1)   
   > The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it   
   > turns into a predicted unconditional branch).   
   >   
   > But:   
   > LD X1, -8(SP); JALR X0, 0(X1)   
   > Yeah, enjoy those 13 or so clock cycles.   
      
    CALX R0,[address]   
   ....   
      
   Address is computed in normal AGEN, but processed in ICache, where it   
   FETCHes wide data (128-bits small machine, whole cache line larger   
   machine), and runs the result through Instruction buffer. 4 cycles.   
      
   -------------   
   > Looks over a sliding window of 10 or 12 instructions:   
   > 4 preceding instructions (-4 to -1);   
    4 new instructions on previous predicted path (0 to 3);   
    4 alternate instructions on current predicted path   
   // so one can decode and issue non-sequential instructions   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|