From: cr88192@gmail.com   
      
   On 11/6/2025 1:11 PM, BGB wrote:   
   > On 11/6/2025 3:24 AM, Michael S wrote:   
   >> On Wed, 05 Nov 2025 21:06:16 GMT   
   >> MitchAlsup wrote:   
   >>   
   >>> Michael S posted:   
   >>>   
   >>>> On Tue, 04 Nov 2025 22:51:28 GMT   
   >>>> MitchAlsup wrote:   
   >>>>> Thomas Koenig posted:   
   >>>>>> Terje Mathisen schrieb:   
   >>>>>>> I still think the IBM DFP people did an impressively good job   
   >>>>>>> packing that much data into a decimal representation. :-)   
   >>>>>>   
   >>>>>> Yes, that modulo 1000 packing is quite clever. It is relatively   
   >>>>>> cheap to implement in hardware (which is the point, of course).   
   >>>>>> Not sure how easy it would be in software.   
   >>>>>   
   >>>>> Brain dead easy: 1 table of 1024 entries each 12-bits wide,   
   >>>>> 1 table of 4096 entries each 10-bits   
   wide,   
   >>>>> isolate the 10-bit field, LD the converted value.   
   >>>>> isolate the 12-bit field, LD the converted value.   
   >>>>>   
   >>>>> Other than "crap loads" of {deMorganizing and gate optimization}   
   >>>>> that is essentially what HW actually does.   
   >>>>>   
   >>>>> You still need to build 12-bit decimal ALUs to string together   
   >>>>   
   >>>> Are talking about hardware or software?   
   >>> A SW solution based on how it would be done in HW.   
   >>   
   >> Then, I suspect that you didn't understand objection of Thomas Koenig.   
   >>   
   >> 1. Format of interest is Decimal128.   
   >> https://en.wikipedia.org/wiki/Decimal128_floating-point_format   
   >>   
   >> 2. According to my understanding, Thomas didn't suggest that *slow*   
   >> software implementation of DPD-encoded DFP, i.e. implementation that   
   >> only cares about correctness, is hard.   
   >>   
   >> 3. OTOH, he seems to suspects, and I agree with him, that *non-slow*   
   >> software implementation, the one comparable in speed (say, within   
   >> factor of 1,5-2) to competent implementation of the same DFP operations   
   >> in BID format, is not easy. If at all possible.   
   >>   
   >> 4. All said above assumes an absence of HW assists.   
   >>   
   >>   
   >>   
   >> BTW, at least for multiplication, I would probably would not do my   
   >> arithmetic in BCD domain.   
   >> Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look   
   >> ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20   
   >> additions).   
   >>   
   >> Then I'd do multiplication and normalization and rounding in Base_1e18.   
   >>   
   >> Then I'd convert from Base_1e18 to Base_1000. The ideas of such   
   >> conversion are similar to fast binary-to-BCD conversion that I   
   >> demonstrated her decade or so ago. AVX2 could be quite helpful at that   
   >> stage.   
   >>   
   >> Then I'd have to convert the result from Base_1000 to DPD. Here, again,   
   >> 11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.   
   >> May be, at that stage SIMD gather can be of help, but I have my doubts.   
   >> So far, every time I tried gather I was disappointed with performance.   
   >>   
   >> Overall, even with seemingly decent plan like sketched above, I'd expect   
   >> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,   
   >> in the past my early performance estimates were wrong quite often.   
   >>   
   >   
   > I decided to start working on a mockup (quickly thrown together).   
   > I don't expect to have much use for it, but meh.   
   >   
   >   
   > It works by packing/unpacking the values into an internal format along   
   > vaguely similar lines to the .NET format, just bigger to accommodate   
   > more digits:   
   > 4x 32-bit values each holding 9 digits   
   > Except the top one generally holding 7 digits.   
   > 16-bit exponent, sign byte.   
   >   
   > Then wrote a few pack/unpack scenarios:   
   > X30: Directly packing 20/30 bit chunks, non-standard;   
   > DPD: Use the DPD format;   
   > BID: Use the BID format.   
   >   
   > For the pack/unpack step (taken in isolation):   
   > X30 is around 10x faster than either DPD or BID;   
   > Both DPD and BID need a similar amount of time.   
   > BID needs a bunch of 128-bit arithmetic handlers.   
   > DPD needs a bunch of merge/split and table lookups.   
   > Seems to mostly balance out in this case.   
   >   
   >   
   > For DPD, merge is effectively:   
   > Do the table lookups;   
   > v=v0+(v1*1000)+(v2*1000000);   
   > With a split step like:   
   > v0=v;   
   > v1=v/1000;   
   > v0-=v1*1000;   
   > v2=v1/1000;   
   > v1-=v2*1000;   
   > Then, use table lookups to go back to DPD.   
   >   
   > Did look into possible faster ways of doing the splitting, but then   
   > noted that have not yet found a faster way that gives correct results   
   > (where one can assume the compiler already knows how to turn divide by   
   > constant into multiply by reciprocal).   
   >   
   >   
   > At first it seemed like a strong reason to favor X30 over either DPD or   
   > BID. Except, that the cost of the ADD and MUL operations effectively   
   > dwarf that of the pack/unpack operations, so the relative cost   
   > difference between X30 and DPD may not matter much.   
   >   
   >   
   > As is, it seems MUL and ADD being roughly 6x more than the cost of the   
   > DPD pack/unpack steps.   
   >   
   > So, it seems, while DPD pack/unpack isn't free, it is not something that   
   > would lead to X30 being a decisive win either in terms of performance.   
   >   
   >   
   >   
   > It might make more sense, if supporting BID, to just do it as its own   
   > thing (and embrace just using a bunch of 128-bit arithmetic, and a   
   > 128*128=>256 bit widening multiply, ...). Also, can note that the BID   
   > case ends up needing a lot more clutter, mostly again because C lacks   
   > native support for 128-bit arithmetic.   
   >   
   > If working based on digit chunks, likely better to stick with DPD due to   
   > less clutter, etc. Though, this part would be less bad if C had had   
   > widespread support for 128-bit integers.   
   >   
   >   
   >   
   > Though, in this case, the ADD and MUL operations currently work by   
   > internally doubling the width and then narrowing the result after   
   > normalization. This is slower, but could give exact results.   
   >   
   >   
   > Though, still not complete nor confirmed to produce correct results.   
   >   
   >   
   >   
   > But, yeah, might be more worthwhile to look into digit chunking:   
   > 12x 3 digits (16b chunk)   
   > 4x 9 digits (32b chunk)   
   > 2x 18 digits (64b chunk)   
   > 3x 12 digits (64b chunk)   
   >   
   > Likely I think:   
   > 3 digits, likely slower because of needing significantly more operations;   
   > 9 digits, seemed sensible, option I went with, internal operations fully   
   > fit within the limits of 64 bit arithmetic;   
   > 18 digits, possible, but runs into many cases internally that would   
   > require using 128-bit arithmetic.   
   >   
   > 12 digits, fits more easily into 64-bit arithmetic, but would still   
   > sometimes exceed it; and isn't that much more than 9 digits (but would   
   > reduce the number of chunks needed from 4 to 3).   
   >   
   >   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|