From: already5chosen@yahoo.com   
      
   On Wed, 17 Dec 2025 20:06:57 GMT   
   MitchAlsup wrote:   
      
   > Michael S posted:   
   >   
   > > On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)   
   > > Thomas Koenig wrote:   
   > >   
   > > > BGB schrieb:   
   > > >   
   > > > > double sin(double ang)   
   > > > > {   
   > > > > double t, x, th, th2;   
   > > > > int i;   
   > > > > i=ang*M_TAU_R;   
   > > > > th=ang-(i*M_TAU);   
   > > > > th2=th*th;   
   > > >   
   > > > That leaves a lot of room for improvement; you would be better   
   > > > off reducing to [-pi/4,+pi/4].   
   > >   
   > > It's not what I'd do in practice. The degree of poly required after   
   > > reduction to [-pi/4,+pi/4] is way too high. It seems, you would need   
   > > Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch   
   > > calls 'faithful rounding'. For something better, like 0.51 ULP, you   
   > > would need one more term.   
   >   
   > To be clear, my coefficients are not restricted to 53-bits like a SW   
   > implementation.   
      
   There exists tricks that can achieve the same in software, sometimes at   
   cost of one additional FP op and sometimes even for free. The latter   
   esp. common when FMA costs the same as FMUL.   
      
   >   
   > > There are better methods. Like reducing to much smaller interval,   
   > > e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details   
   > > of trade off between size of reduction table and length of   
   > > polynomial depend on how often do you plan to use your sin()   
   > > function.   
   >   
   > With every shrink of the argument range, the table size blows up   
   > exponentially. For my transcendentals, the combined table sizes   
   > are about the same as the table sizes for FDIV/FSQRT when using   
   > Goldschmidt iteration using 11-bit in, 9 bit out tables.   
      
   Software trade offs are different.   
   Assuming that argument is already in [0:-pi/2] range, reduction down   
   to [-1/128:+1/128] requires pi/2*64=100 table entries. Each entry   
   occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400   
   bytes total. That size fits very comfortably in L1D cache, so from   
   perspectives of improvement of hit rate there is no incentive to use   
   smaller table.   
      
   At the first glance, reduction to [-1/128:+1/128] appears   
   especially attractive for implementation that does not look for very   
   high precision. Something like 0.65 ULP looks attainable with poly in   
   form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate   
   in late evening hours, so I can be wrong about it. But I also can be   
   right.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|