home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,377 of 4,675   
   Terje Mathisen to Terje Mathisen   
   Re: Reciprocal MUL LUT   
   26 Apr 18 21:25:49   
   
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Terje Mathisen wrote:   
   > James Van Buskirk wrote:   
   >> "wolfgang kern"  wrote in message   
   >> news:pbsmop$1fig$1@gioia.aioe.org...   
   >>> Rod Pemberton asked:   
   >>   
   >>>> What is the smallest prime in your LUT which has a period of   
   >>>> 512 bits?   
   >>   
   >>> haven't checked it yet, I have to write a period detector first.   
   >>   
   >> Check 1238926361552897 .   
   >>   
   >   
   > I'm looking at the possibility of just calculating these on the fly:   
   >   
   > Use either the SSE reciprocal estimator or a FP division to get a   
   > starting approximation, then use a NR iteration to double the number   
   > of bits each round.   
   >   
   > When you get past 64 bits you have to start working with 128, then   
   > 256 and 512 bit values using multiple 64-bit chunks, but it should   
   > still be relatively fast.   
      
   OK, I've done the math:   
      
   If you start with a SSE reciprocal lookup you get a fp number with ~12   
   signficant bits = app_1_over_n.   
      
   You then run one iteration with 32-bit float:   
      
     prod = n * app_1_over_n;   // This value will be close to 1!   
      
     adjust = 2 - prod;	// Close to 1.0   
      
     app_1_over_n *= adjust; // Approximately 24 bits accurate   
      
   For the next iteration we switch to double and do the same:   
      
     app_1_over_n_d = (double) app_1_over_n;   
      
     prod_d = n * app_1_over_n_d;   
      
     adjust_d = 2 - prod_d;   
      
     app_1_over_n_d *= adjust; // Approximately 48 bits accurate   
      
   (Up to now we have used about 20 cycles, so a chip with a fast divider   
   could possibly give us 53 bits in about the same time.)   
      
   At this point we switch to integer code only and run one iteration with   
   64-bit values resulting in nearly 64-bit accurate values.   
      
   For the rest we run a slightly different NR iteration:   
      
      adj = 1 - n*approx;   
      approx += adj*approx;   
      
   We only need 64 bits for the adjustment factor since it will be close to   
   zero, the adj*approx product gives us 128 bits to add to the 64-bit   
   approx value and we still double the precision.   
      
   The same way we can work with 128-bit numbers for almost the entire   
   256-bit stage and 256-bit numbers for the 512-bit.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca