From: cr88192@gmail.com   
      
   On 11/11/2025 6:03 AM, Michael S wrote:   
   > On Tue, 11 Nov 2025 04:44:48 -0600   
   > BGB wrote:   
   >   
   >> On 11/11/2025 4:02 AM, Michael S wrote:   
   >>> On Mon, 10 Nov 2025 21:25:47 -0600   
   >>> BGB wrote:   
   >>>   
   >>>> On 11/10/2025 4:08 PM, Michael S wrote:   
   >>>>> On Mon, 10 Nov 2025 13:54:23 -0600   
   >>>>> BGB wrote:   
   >>>>>   
   >>>>>> On 11/10/2025 1:16 AM, Terje Mathisen wrote:   
   >>>>>>> BGB wrote:   
   >>>>>>>> DIV uses Newton-Raphson   
   >>>>>>>> The process of converging is a lot more fiddly than with Binary   
   >>>>>>>> FP. Partly as the strategy for generating the initial guess is   
   >>>>>>>> far less accurate.   
   >>>>>>>>   
   >>>>>>>> So, it first uses a loop with hard-coded checks and scales to   
   >>>>>>>> get it in the general area, before then letting N-R take over.   
   >>>>>>>> If the value isn't close enough (seemingly +/- 25% or so), N-R   
   >>>>>>>> flies off into space.   
   >>>>>>>>   
   >>>>>>>> Namely:   
   >>>>>>>> Exponent is wrong:   
   >>>>>>>> Scale by factors of 2 until correct;   
   >>>>>>>> Off by more than 50%, scale by +/- 25%;   
   >>>>>>>> Off by more than 25%, scale by +/- 12.5%;   
   >>>>>>>> Else: Good enough, let normal N-R take over.   
   >>>>>>>   
   >>>>>>> My possibly naive idea would extract the top 9-15 digits from   
   >>>>>>> divisor and dividend, convert both to binary FP, do the division   
   >>>>>>> and convert back.   
   >>>>>>>   
   >>>>>>> That would reduce the NR step to two or three iterations, right?   
   >>>>>>>   
   >>>>>>   
   >>>>>> After adding code to feed to convert to/from 'double', and using   
   >>>>>> this for initial reciprocal and square-root:   
   >>>>>> DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than   
   >>>>>> MUL);   
   >>>>>   
   >>>>> That is your timing for Decimal128 on modern desktop PC?   
   >>>>> Dependent divisions or independent?   
   >>>>> Even for dependent, it sounds slow.   
   >>>>>   
   >>>>   
   >>>> Modern-ish...   
   >>>>   
   >>>   
   >>> Zen2 ?   
   >>> I consider it the last of non-modern. Zen3 and Ice Lake are first   
   >>> of modern. 128by64 bit integer division on Zen2 is still quite slow   
   >>> and overall uArch is even less advanced than 10 y.o. Intel Skylake.   
   >>> In majority of real-world workloads it's partially compensated by   
   >>> Zen2 bigger L3 cache. In our case big cache does not help.   
   >>> But even last non-modern CPU shall be capable to divide faster than   
   >>> suggested by your numbers.   
   >>>   
   >>   
   >> Zen+   
   >>   
   >> Or, a slightly tweaked version of Zen1.   
   >>   
   >>   
   >> It is very well possible to do big integer divide faster than this.   
   >> Such as via shift-and-add.   
   >>   
   >> But, as for decimal, this makes it harder.   
   >>   
   >>   
   >> I could do long division, but this is a much more complicated   
   >> algorithm (versus using Newton-Raphson).   
   >>   
   >> But, N-R is slow as it is basically a bunch of operations, which are   
   >> granted themselves, each kinda slow.   
   >>   
   >>   
   >>>   
   >>>> I am running a CPU type that was originally released 7 years ago,   
   >>>> with slower RAM than it was designed to work with.   
   >>>>   
   >>>>   
   >>>>> Did you try to compare against brute force calculation using GMP?   
   >>>>> https://gmplib.org/   
   >>>>> I.e. asuming that num < den < 10*num use GMP to calculate 40   
   >>>>> decimal digits of intermediate result y as follows:   
   >>>>> Numx = num * 1e40;   
   >>>>> y = Numx/den;   
   >>>>> Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure   
   >>>>> out why).   
   >>>>> If Yf != 5e5 then you finished. Only in extremely rare case (1 in   
   >>>>> a million) of Yf == 5e5 you will have to calculate reminder of   
   >>>>> Numx/den to found correct rounding.   
   >>>>> Somehow, I suspect that on modern PC even non-optimized method   
   >>>>> like above will be faster tham 670 usec.   
   >>>>>   
   >>>>>   
   >>>>   
   >>>>   
   >>>> Well, first step is building with GCC rather than MSVC...   
   >>>>   
   >>>> It would appear that it gets roughly 79% faster when built with   
   >>>> GCC. So, around 2 million divides per second.   
   >>>>   
   >>>>   
   >>>>   
   >>>> As for GMP, dividing two 40 digit numbers:   
   >>>> 22 million per second.   
   >>>> If I do both a divide and a remainder:   
   >>>> 16 million.   
   >>>>   
   >>>> I don't really get what you are wanting me to measure exactly   
   >>>> though...   
   >>>   
   >>>   
   >>> I want you to measure division of 74-digit integer by 34-digit   
   >>> integer, because it is the slowest part [of brute force   
   >>> implementation] of Decimal128 division. The rest of division is   
   >>> approximately the same as multiplication.   
   >>> So, [unoptimized] Decimal128 division time should be no worse than   
   >>> t1+t2, where t1 is duration of Decimal128 multiplication and t2 is   
   >>> duration of above-mentioned integer division. An estimate is   
   >>> pessimistic, because post-division normalization tends to be simpler   
   >>> than post-multiplication normalization.   
   >>> Optimized division would be faster yet.   
   >>>   
   >>   
   >> If it is a big-integer divide, this is not quite the same thing.   
   >>   
   >> And, if I were to use big-integer divide (probably not via GMP   
   >> though,   
   >   
   > Certainly not via GMP in final product. But doing 1st version via GMP   
   > makes perfect sense.   
   >   
      
   GMP is only really an option for targets where GMP exists;   
   Needed to jump over to GCC in WSL just to test GMP here.   
      
   If avoidable, you don't want to use anything beyond the C standard   
   library, and ideally limit things to a C95 style dialect for maximum   
   portability.   
      
   Granted, it does appear like the GMP divider is faster than expected.   
    Like, possibly something faster than "ye olde shift-and-subtract".   
      
      
      
      
   Though, can note a curious property:   
    This code is around 79% faster when built with GCC vs MSVC;   
    In GCC, the relative speed of MUL and ADD trade places:   
    In MSVC, MUL is faster;   
    In GCC, ADD is faster.   
      
   Though, the code in question tends to frequently use struct members   
   directly, rather than caching multiply-accessed struct members in local   
   variables. MSVC tends not to fully optimize away this sort of thing,   
   whereas GCC tends to act as-if the struct members had in-fact been   
   cached in local variables.   
      
      
   >> this would be too big of a dependency), there is still the   
   >> issue of efficiently converting between big-integer and the "groups   
   >> of 9 digits in 32-bits" format.   
   >   
   > No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64   
   > binary 'digits' per 64-bit word.   
   >   
      
   Alas, the code was written mostly to use 9-digit groupings, and going   
   between 9-digit groupings and 128-bit integers is a bigger chunk of code   
   than I want to have for this.   
      
   This would mean an additional ~ 500 LOC, plus probably whatever code I   
   need to do a semi-fast 256 by 128 bit integer divider.   
      
      
   >>   
   >>   
   >> This is partly why I removed the BID code:   
   >> At first, it seemed like the DPD and BID converters were similar   
   >> speed; But, turns out I was still testing the DPD converter, and   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|