... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.asm.x86

Ahh, the lost art of x86 assembly

4,675 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 3,378 of 4,675

wolfgang kern to All

Re: Reciprocal MUL LUT

27 Apr 18 13:46:05

   From: nowhere@never.at   
      
   Thanks Terje,   
      
   >> I'm looking at the possibility of just calculating these on the fly:   
      
   >> Use either the SSE reciprocal estimator or a FP division to get a   
   >> starting approximation, then use a NR iteration to double the number   
   >> of bits each round.   
      
   >> When you get past 64 bits you have to start working with 128, then   
   >> 256 and 512 bit values using multiple 64-bit chunks, but it should   
   >> still be relatively fast.   
      
   > OK, I've done the math:   
   >   
   > If you start with a SSE reciprocal lookup you get a fp number with ~12   
   > signficant bits = app_1_over_n.   
   >   
   > You then run one iteration with 32-bit float:   
   >   
   >  prod = n * app_1_over_n;   // This value will be close to 1!   
   >   
   >  adjust = 2 - prod; // Close to 1.0   
   >   
   >  app_1_over_n *= adjust; // Approximately 24 bits accurate   
   >   
   > For the next iteration we switch to double and do the same:   
   >   
   >  app_1_over_n_d = (double) app_1_over_n;   
   >   
   >  prod_d = n * app_1_over_n_d;   
   >   
   >  adjust_d = 2 - prod_d;   
   >   
   >  app_1_over_n_d *= adjust; // Approximately 48 bits accurate   
   >   
   > (Up to now we have used about 20 cycles, so a chip with a fast divider   
   > could possibly give us 53 bits in about the same time.)   
   >   
   > At this point we switch to integer code only and run one iteration with   
   > 64-bit values resulting in nearly 64-bit accurate values.   
   >   
   > For the rest we run a slightly different NR iteration:   
   >   
   >   adj = 1 - n*approx;   
   >   approx += adj*approx;   
   >   
   > We only need 64 bits for the adjustment factor since it will be close to   
   > zero, the adj*approx product gives us 128 bits to add to the 64-bit   
   > approx value and we still double the precision.   
   >   
   > The same way we can work with 128-bit numbers for almost the entire   
   > 256-bit stage and 256-bit numbers for the 512-bit.   
      
   I'll check if such a multistage NR doesn't show any rounding issues,   
   which I was/am afraid of.   
   __   
   wolfgang   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]