Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,377 of 4,675    |
|    Terje Mathisen to Terje Mathisen    |
|    Re: Reciprocal MUL LUT    |
|    26 Apr 18 21:25:49    |
      From: terje.mathisen@nospicedham.tmsw.no              Terje Mathisen wrote:       > James Van Buskirk wrote:       >> "wolfgang kern" wrote in message       >> news:pbsmop$1fig$1@gioia.aioe.org...       >>> Rod Pemberton asked:       >>       >>>> What is the smallest prime in your LUT which has a period of       >>>> 512 bits?       >>       >>> haven't checked it yet, I have to write a period detector first.       >>       >> Check 1238926361552897 .       >>       >       > I'm looking at the possibility of just calculating these on the fly:       >       > Use either the SSE reciprocal estimator or a FP division to get a       > starting approximation, then use a NR iteration to double the number       > of bits each round.       >       > When you get past 64 bits you have to start working with 128, then       > 256 and 512 bit values using multiple 64-bit chunks, but it should       > still be relatively fast.              OK, I've done the math:              If you start with a SSE reciprocal lookup you get a fp number with ~12       signficant bits = app_1_over_n.              You then run one iteration with 32-bit float:               prod = n * app_1_over_n; // This value will be close to 1!               adjust = 2 - prod; // Close to 1.0               app_1_over_n *= adjust; // Approximately 24 bits accurate              For the next iteration we switch to double and do the same:               app_1_over_n_d = (double) app_1_over_n;               prod_d = n * app_1_over_n_d;               adjust_d = 2 - prod_d;               app_1_over_n_d *= adjust; // Approximately 48 bits accurate              (Up to now we have used about 20 cycles, so a chip with a fast divider       could possibly give us 53 bits in about the same time.)              At this point we switch to integer code only and run one iteration with       64-bit values resulting in nearly 64-bit accurate values.              For the rest we run a slightly different NR iteration:               adj = 1 - n*approx;        approx += adj*approx;              We only need 64 bits for the adjustment factor since it will be close to       zero, the adj*approx product gives us 128 bits to add to the 64-bit       approx value and we still double the precision.              The same way we can work with 128-bit numbers for almost the entire       256-bit stage and 256-bit numbers for the 512-bit.              Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca