From: already5chosen@yahoo.com   
      
   On Mon, 10 Nov 2025 21:25:47 -0600   
   BGB wrote:   
      
   > On 11/10/2025 4:08 PM, Michael S wrote:   
   > > On Mon, 10 Nov 2025 13:54:23 -0600   
   > > BGB wrote:   
   > >    
   > >> On 11/10/2025 1:16 AM, Terje Mathisen wrote:    
   > >>> BGB wrote:    
   > >>>> DIV uses Newton-Raphson   
   > >>>> The process of converging is a lot more fiddly than with Binary   
   > >>>> FP. Partly as the strategy for generating the initial guess is   
   > >>>> far less accurate.   
   > >>>>   
   > >>>> So, it first uses a loop with hard-coded checks and scales to get   
   > >>>> it in the general area, before then letting N-R take over. If the   
   > >>>> value isn't close enough (seemingly +/- 25% or so), N-R flies off   
   > >>>> into space.   
   > >>>>   
   > >>>> Namely:   
   > >>>> Exponent is wrong:   
   > >>>> Scale by factors of 2 until correct;   
   > >>>> Off by more than 50%, scale by +/- 25%;   
   > >>>> Off by more than 25%, scale by +/- 12.5%;   
   > >>>> Else: Good enough, let normal N-R take over.    
   > >>>   
   > >>> My possibly naive idea would extract the top 9-15 digits from   
   > >>> divisor and dividend, convert both to binary FP, do the division   
   > >>> and convert back.   
   > >>>   
   > >>> That would reduce the NR step to two or three iterations, right?   
   > >>>    
   > >>   
   > >> After adding code to feed to convert to/from 'double', and using   
   > >> this for initial reciprocal and square-root:   
   > >> DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than   
   > >> MUL);    
   > >    
   > > That is your timing for Decimal128 on modern desktop PC?   
   > > Dependent divisions or independent?   
   > > Even for dependent, it sounds slow.   
   > >    
   >    
   > Modern-ish...   
   >    
      
   Zen2 ?   
   I consider it the last of non-modern. Zen3 and Ice Lake are first   
   of modern. 128by64 bit integer division on Zen2 is still quite slow   
   and overall uArch is even less advanced than 10 y.o. Intel Skylake.   
   In majority of real-world workloads it's partially compensated by   
   Zen2 bigger L3 cache. In our case big cache does not help.   
   But even last non-modern CPU shall be capable to divide faster than   
   suggested by your numbers.   
      
      
   > I am running a CPU type that was originally released 7 years ago,   
   > with slower RAM than it was designed to work with.   
   >    
   >    
   > > Did you try to compare against brute force calculation using GMP?   
   > > https://gmplib.org/   
   > > I.e. asuming that num < den < 10*num use GMP to calculate 40   
   > > decimal digits of intermediate result y as follows:   
   > > Numx = num * 1e40;   
   > > y = Numx/den;   
   > > Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure   
   > > out why).   
   > > If Yf != 5e5 then you finished. Only in extremely rare case (1 in a   
   > > million) of Yf == 5e5 you will have to calculate reminder of   
   > > Numx/den to found correct rounding.   
   > > Somehow, I suspect that on modern PC even non-optimized method like   
   > > above will be faster tham 670 usec.   
   > >    
   > >    
   >    
   >    
   > Well, first step is building with GCC rather than MSVC...   
   >    
   > It would appear that it gets roughly 79% faster when built with GCC.   
   > So, around 2 million divides per second.   
   >    
   >    
   >    
   > As for GMP, dividing two 40 digit numbers:   
   > 22 million per second.   
   > If I do both a divide and a remainder:   
   > 16 million.   
   >    
   > I don't really get what you are wanting me to measure exactly   
   > though...   
      
      
   I want you to measure division of 74-digit integer by 34-digit integer,   
   because it is the slowest part [of brute force implementation] of   
   Decimal128 division. The rest of division is approximately the same as   
   multiplication.   
   So, [unoptimized] Decimal128 division time should be no worse than   
   t1+t2, where t1 is duration of Decimal128 multiplication and t2 is   
   duration of above-mentioned integer division. An estimate is   
   pessimistic, because post-division normalization tends to be simpler   
   than post-multiplication normalization.   
   Optimized division would be faster yet.    
      
   >    
   >    
   > If I compare against the IBM decNumber library:   
   > Multiply: 14 million.   
   > Divide: 7 million   
   >    
   > The decNumber library doesn't appear to have a square-root function...   
   >    
   >    
   > Granted, there are possibly faster ways to do divide, versus using    
   > Newton-Raphson in this case...   
   >    
   > It was not the point that I could pull the fastest possible    
   > implementation out of thin air. But, does appear I am beating   
   > decNumber at least for multiply performance and similar.   
   >    
   >    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|