... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,535 of 131,241
BGB to All
Re: trap and emulate, Lessons from the A
17 Dec 25 02:27:33
   From: cr88192@gmail.com   
      
   On 12/17/2025 1:11 AM, Lawrence D’Oliveiro wrote:   
   > On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:   
   >   
   >> Misaligned access is common enough here that, if it were not supported   
   >> natively, this would likely tank performance...   
   >   
   > Still there are/were some architectures that refused to support it.   
      
   Yes.   
      
   Or, like the "SiFive U74" and similar, where the funny thing of the   
   RISC-V ISA using unscaled displacements but then having a CPU that uses   
   internal traps (and is horribly slow) in the case of misaligned access...   
      
   Meanwhile, I prefer to have memcpy and LZ decompression where   
   "performance doesn't suck".   
      
   Also useful for things like Huffman and Rice decoding, etc. Say, for   
   Huffman decoding, if one needs to use branches to detect when to pull in   
   more bytes, this eats more clock-cycles than advancing the bit-stream   
   position implicitly via arithmetic tricks.   
      
   Well, and is also an example of why to use LSB first bit ordering, and   
   not to use FF escape encodings and similar:   
   MSB first, FF escapes, and the 16-bit length limit, etc, manage to make   
   JPEG bit-stream handling a lot slower than it could have been.   
      
   Whereas, say, LSB-first and imposing a 12-bit length limit allows some   
   speedup here.   
      
   Though, the Rice coder in UPIC effectively uses an 8-bit lookup, but   
   this is because it uses 3 bits for the Rk factor. So, sadly, it needs a   
   fallback path to decode symbols that exceed 8 bits.   
      
   So, pseudo-code (for AdRice Decoding):   
      win=*(u32 *)cs;   
      b=win>>pos;   
      ix=(rk<<8)|(b&255);   
      v=ricefasttab[ix];  //constant lookup table for Rice-code state space   
      l=(v>>8)&15;   
      if(l<=8)   
      {   
        //faster path   
        pos+=l;   
        cs+=pos>>3;   
        pos&=7;   
        rk=(v>>12);   
        return(v&255);   
      }   
      // ... slower path ...   
      q=riceqtab[b&255];  //count bits for Q prefix.   
      if(q==8)   
      {   
        //escape case, Q==8 escapes a raw max-length symbol   
        l=16;   
        v=(b>>8)&255;   
        rk+=2;   
        if(rk>7)rk=7;   
      }else   
      {   
        l=q+rk+1;   
        v=((b>>(q+1))&((1<0)) rk--;   
        if((q>=2) && (rk<7)) rk++;   
      }   
      pos+=l;   
      cs+=pos>>3;   
      pos&=7;   
      return(v);   
      
   Which may not seem very fast, but could be a lot worse.   
      
   In this case (for L1 cache reasons) the slightly more complicated   
   approach here works out faster on average than using a single giant   
   lookup table.   
      
      
      
   So, my CPU supports misaligned access natively.   
      
      
   Can make sense to skip it for microcontroller class cores though; since   
   in this case "cheaper L1 cache" is likely to be a higher priority.   
      
   Doesn't make sense for things bigger than a microcontroller though, as   
   allowing for misaligned memory accesses is too useful IMO.   
      
   ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]