From: user5857@newsgrouper.org.invalid   
      
   BGB posted:   
      
   > On 8/23/2025 10:11 AM, Terje Mathisen wrote:   
   > > BGB wrote:   
   -------------   
   > >   
   > > Mitch and I have repeated this too many times already:   
   > >   
   > > If you are implementing a current-standards FPU, including FMAC support,   
   > > then you already have the very wide normalizer which is the only   
   > > expensive item needed to allow zero-cycle denorm cost.   
   > >   
   >   
   > Errm, no single-rounded FMA in my case, as single rounded FMA (for   
   > Binary64) would also require Trap-and-Emulate...   
   >   
   > But, yeah, Free if you have FMA, is not the same as FMA being free.   
   >   
   > Partial issue is that single rounded FMA would effectively itself have   
   > too high of cost (and an FMA unit would require higher latency than   
   > separate FMUL and FADD units).   
      
   FMA latency < (FMUL + FADD) latency   
   FMA latency >= FMUL latency   
   FMA latency >= FADD latency   
      
   > Ironically, what FMA operations exist tend to be slower for Binary32 ops   
   > than using separate MUL and ADD ops in the default (non-IEEE) mode.   
   > Though for Binary64, it would be slightly faster, though still   
   > double-rounded-ish. They can mimic Single-Rounded behavior with Binary32   
   > and Binary16 though mostly for sake of internally operating on Binary64.   
      
   You must accept that::   
      
    FMA Rd,Rs1,Rs2,Rs3   
    FSUB Re,Rd,Rs3   
      
   leaves all the proper bits in Re; whereas you cannot even argue::   
      
    FMUL Rd,Rs1,Rs2   
    FADD Re,Rd,Rs3   
    RSUB Re,Re,R3   
      
   leaves all the proper bits in Re !! in all cases !!   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|