home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 130,141 of 131,241   
   Terje Mathisen to Michael S   
   Re: Tonights Tradeoff   
   05 Nov 25 15:42:37   
   
   From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Tue, 4 Nov 2025 22:52:46 +0100   
   > Terje Mathisen  wrote:   
   >>   
   >> For the Intel binary mantissa dfp128 normalization is the hard issue,   
   >> Michael S have figured out some really nice tricks to speed it up,   
   >   
   > I remember that I played with that, but don't remember what I did   
   > exactly. I dimly recollect that the fastest solution was relatively   
   > straight-forward. It was trying to minimize the length of dependency   
   > chains rather than total number of multiplications.   
   > An important point here is that I played on relatively old x86-64   
   > hardware. My solution is not necessarily optimal for newer hardware.   
   > The differences between old and new are two-fold and they push   
   > optimal solution into different directions.   
   > 1. Increase in throughput of integer multiplier   
   > 2. Decrease in latency of integer division   
   >   
   > The first factor suggests even more intense push toward "eager"   
   > solutions.   
   >   
   > The second factor suggests, possibly, much simpler code, especially in   
   > common case of division by 1 to 27 decimal digits (5**27 < 2**64).   
   > How they say? Sometimes a division is just a division.   
      
   I suspect that a model using pre-calculated reciprocals which generate   
   ~10+ approximate digits, back-multiply and subtract, repeat once or   
   twice, could perform OK.   
      
   Having full ~225 bit reciprocals in order to generate the exact result   
   in a single iteration would require 256-bit storage for each of them and   
   the 256x256->512 MUL would use 16 64x64->128 MULs, but here we do have   
   the possibility to start from the top and as soon as you get the high   
   end 128 bits of the mantissa fixed (modulo any propagating carries from   
   lower down) you could inspect the preliminary result and see that it   
   would usually be far enough away from a tipping point so that you could   
   stop there.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca