... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,744 of 131,241
MitchAlsup to All
Re: Variable-length instructions
03 Jan 26 02:05:15
   From: user5857@newsgrouper.org.invalid   
      
   BGB  posted:   
      
   > On 1/2/2026 12:48 PM, MitchAlsup wrote:   
   > > -----merciful snip----------   
   > > I have heard arguments in both directions::   
   > >   
   > > a) DISP64 only contains 33-bits of actual information   
   > > b) If DISP64 is absolute do you still need Rbase ??   
   > >     when you have Rindex< > c) how can the HW KNOW ?!?   
   > > ------------------   
   > >>>> I prefer to use multiply ‘*’ rather than shift in scaled indexed   
   > >>>> addressing as a couple of CPUs had multiply by five and ten in addition   
   > >>>> to 1,2,4,8. What if one wants to scale by 3?   
   > >>>   
   > >>> If you have the bits, why not.   
   > >>>   
   > >>   
   > >> Higher resource cost and latency is a concern...   
   > >   
   > > Yes, your design is living on the edge.   
      
   This, BTW is a compliment--the best an architect can do is make every   
   stage of the pipeline have the same delay !!   
      
   > I am not sure how it would be pulled off for larger displacements or   
   > more general scales.   
      
   Better adder technology. We routinely pound an 11-gate adder into the   
   delay of 8×Fan4 gate delays.   
   ------------------------------------   
   > Say:   
   >    void _mem_cpy16bytes(void *dst, void *src)   
   >    {   
   >      byte *cs, *ct;   
   >      cs=src; ct=dst;   
   >      ct[ 0]=cs[ 0];    ct[ 1]=cs[ 1];    ct[ 2]=cs[ 2];    ct[ 3]=cs[ 3];   
   >      ct[ 4]=cs[ 4];    ct[ 5]=cs[ 5];    ct[ 6]=cs[ 6];    ct[ 7]=cs[ 7];   
   >      ct[ 8]=cs[ 8];    ct[ 9]=cs[ 9];    ct[10]=cs[10];    ct[11]=cs[11];   
   >      ct[12]=cs[12];    ct[13]=cs[13];    ct[14]=cs[14];    ct[15]=cs[15];   
   >    }   
   > Is, slow...   
   Better ISA::   
      
        MM   Rto,Rfrom,#16   
      
   and let HW do all the tricky/cool stuff--just make sure if you put it   
   in you fully support all the cool/tricky stuff.   
      
   > The store-to-load forwarding penalty being because LZ4 decompression   
   > often involves copying memory on top of itself, and the possible   
   > workarounds for this issue only offer competitive performance for blocks   
   > that are much longer than the typical copy (in the common case of a   
   > match under 20 bytes, it often being faster to just copy bytes and eat   
   > the cost).   
   >   
   >    if(dist>=16)   
   >    {   
   >      if(len>=20)   
   >        { more generalized/faster copy }   
   >      else   
   >        { just copy 20 bytes. }   
   >    }else   
   >    {   
   >      if(len>=20)   
   >         { generate pattern and fill with stride }   
   >      else   
   >         { copy 20 bytes over itself. }   
   >    }   
      
   This is a problem easier solved in HW than in source code.   
      
   > >   
   > > For reasons like this, I only have   
   > >   
   > >        CALL  DISP26<<2       // call through DECODE   
   > > and   
   > >        CALX  [*address]      // call through table   
   > > and   
   > >        CALA  [address]       // call through AGEN   
   > >   
   > > which prevents compiler and assembler abuse.   
   >   
   >   
   > They went and defined that you can use any register as a link register,   
      
   Another case where they screwed up.....   
      
   > but in practice there is basically no reason to use alternative link   
   > registers. ASM programmer people could do so, but not seen all that much   
   > evidence of this being a thing thus far.   
      
   In Mc88k we recognized (and made compiler follow)   
          JMP   R1     // return from subroutine   
          JMP  ~R1     // switch   
   -------------------   
   > Well, say, vs my approach:   
   >    LD X1, Disp(SP); ....; JALR X0, 0(X1)   
   > The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it   
   > turns into a predicted unconditional branch).   
   >   
   > But:   
   >    LD X1, -8(SP); JALR X0, 0(X1)   
   > Yeah, enjoy those 13 or so clock cycles.   
      
        CALX    R0,[address]   
   ....   
      
   Address is computed in normal AGEN, but processed in ICache, where it   
   FETCHes wide data (128-bits small machine, whole cache line larger   
   machine), and runs the result through Instruction buffer. 4 cycles.   
      
   -------------   
   > Looks over a sliding window of 10 or 12 instructions:   
   >    4 preceding instructions (-4 to -1);   
        4 new instructions on previous predicted path (0 to 3);   
        4 alternate instructions on current predicted path   
   // so one can decode and issue non-sequential instructions   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]