From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Sat, 21 Feb 2026 20:36:51 +0200   
   > Michael S wrote:   
   >   
   >>>   
   >>> Using a more brainiac approach would likely cut the performance in   
   >>> half and use a lot more LUTs.   
   >>>   
   >>   
   >> According to my experiments, combining 2 steps have very small   
   >> negative effect on achivable clock and increases area by ~1.8x. So,   
   >> to me it looks like a win.   
   >> That's the combined step I am talking about:   
   >> x = n/4;   
   >> switch (x % 4) {   
   >> case 0: n = x; break;   
   >> case 1: n = 3*x+1; break;   
   >> case 2: n = 2*x+2; break;   
   >> case 3: n = 9*x+8; break;   
   >> }   
   >>   
   >>   
   >>   
   >   
   > Mistake in C above. Should be   
   > switch (n % 4) {   
      
   I did notice that. :-)   
      
   The one thing that worries me (sw on a 64-bit platform) about the code   
   is the 9* on 128-bit variables:   
      
    9*x+8 =>   
      
   Do we use SHLD + SHL here or something else?   
      
   How about MUL & LEA?   
      
   ; Input in r10:r9, output in rdx:rax   
   mov rax,r9   
   mul rax,rbx ;; RBX == 9   
   lea r10,[r10+r10*8]   
   add rdx,r10   
      
   That looks like 5-6 clock cycles, so the branch misses from the switch   
   statement would probably dominate unless you do as I suggested and use   
   lookup tables instead:   
      
    let bot2 = n & 3;   
    let x = n >> 2;   
    n = x*multab[bot2] + addtab[bot2];   
      
   but if we do that, then (at least for a sw implementation) it would be   
   better to pick a lot more of the LS bits, at least 8-12?   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|