From: cr88192@gmail.com   
      
   On 11/6/2025 3:24 AM, Michael S wrote:   
   > On Wed, 05 Nov 2025 21:06:16 GMT   
   > MitchAlsup wrote:   
   >   
   >> Michael S posted:   
   >>   
   >>> On Tue, 04 Nov 2025 22:51:28 GMT   
   >>> MitchAlsup wrote:   
   >>>   
   >>>> Thomas Koenig posted:   
   >>>>   
   >>>>> Terje Mathisen schrieb:   
   >>>>>   
   >>>>>> I still think the IBM DFP people did an impressively good job   
   >>>>>> packing that much data into a decimal representation. :-)   
   >>>>>   
   >>>>> Yes, that modulo 1000 packing is quite clever. It is relatively   
   >>>>> cheap to implement in hardware (which is the point, of course).   
   >>>>> Not sure how easy it would be in software.   
   >>>>   
   >>>> Brain dead easy: 1 table of 1024 entries each 12-bits wide,   
   >>>> 1 table of 4096 entries each 10-bits wide,   
   >>>> isolate the 10-bit field, LD the converted value.   
   >>>> isolate the 12-bit field, LD the converted value.   
   >>>>   
   >>>> Other than "crap loads" of {deMorganizing and gate optimization}   
   >>>> that is essentially what HW actually does.   
   >>>>   
   >>>> You still need to build 12-bit decimal ALUs to string together   
   >>>   
   >>> Are talking about hardware or software?   
   >>>   
   >> A SW solution based on how it would be done in HW.   
   >   
   > Then, I suspect that you didn't understand objection of Thomas Koenig.   
   >   
   > 1. Format of interest is Decimal128.   
   > https://en.wikipedia.org/wiki/Decimal128_floating-point_format   
   >   
   > 2. According to my understanding, Thomas didn't suggest that *slow*   
   > software implementation of DPD-encoded DFP, i.e. implementation that   
   > only cares about correctness, is hard.   
   >   
   > 3. OTOH, he seems to suspects, and I agree with him, that *non-slow*   
   > software implementation, the one comparable in speed (say, within   
   > factor of 1,5-2) to competent implementation of the same DFP operations   
   > in BID format, is not easy. If at all possible.   
   >   
   > 4. All said above assumes an absence of HW assists.   
   >   
   >   
   >   
   > BTW, at least for multiplication, I would probably would not do my   
   > arithmetic in BCD domain.   
   > Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look   
   > ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20   
   > additions).   
   >   
   > Then I'd do multiplication and normalization and rounding in Base_1e18.   
   >   
   > Then I'd convert from Base_1e18 to Base_1000. The ideas of such   
   > conversion are similar to fast binary-to-BCD conversion that I   
   > demonstrated her decade or so ago. AVX2 could be quite helpful at that   
   > stage.   
   >   
   > Then I'd have to convert the result from Base_1000 to DPD. Here, again,   
   > 11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.   
   > May be, at that stage SIMD gather can be of help, but I have my doubts.   
   > So far, every time I tried gather I was disappointed with performance.   
   >   
   > Overall, even with seemingly decent plan like sketched above, I'd expect   
   > DPD multiplication to be 2.5x to 3x slower than BID. But, then again,   
   > in the past my early performance estimates were wrong quite often.   
   >   
      
   I decided to start working on a mockup (quickly thrown together).   
    I don't expect to have much use for it, but meh.   
      
      
   It works by packing/unpacking the values into an internal format along   
   vaguely similar lines to the .NET format, just bigger to accommodate   
   more digits:   
    4x 32-bit values each holding 9 digits   
    Except the top one generally holding 7 digits.   
    16-bit exponent, sign byte.   
      
   Then wrote a few pack/unpack scenarios:   
    X30: Directly packing 20/30 bit chunks, non-standard;   
    DPD: Use the DPD format;   
    BID: Use the BID format.   
      
   For the pack/unpack step (taken in isolation):   
    X30 is around 10x faster than either DPD or BID;   
    Both DPD and BID need a similar amount of time.   
    BID needs a bunch of 128-bit arithmetic handlers.   
    DPD needs a bunch of merge/split and table lookups.   
    Seems to mostly balance out in this case.   
      
      
   For DPD, merge is effectively:   
    Do the table lookups;   
    v=v0+(v1*1000)+(v2*1000000);   
   With a split step like:   
    v0=v;   
    v1=v/1000;   
    v0-=v1*1000;   
    v2=v1/1000;   
    v1-=v2*1000;   
    Then, use table lookups to go back to DPD.   
      
   Did look into possible faster ways of doing the splitting, but then   
   noted that have not yet found a faster way that gives correct results   
   (where one can assume the compiler already knows how to turn divide by   
   constant into multiply by reciprocal).   
      
      
   At first it seemed like a strong reason to favor X30 over either DPD or   
   BID. Except, that the cost of the ADD and MUL operations effectively   
   dwarf that of the pack/unpack operations, so the relative cost   
   difference between X30 and DPD may not matter much.   
      
      
   As is, it seems MUL and ADD being roughly 6x more than the cost of the   
   DPD pack/unpack steps.   
      
   So, it seems, while DPD pack/unpack isn't free, it is not something that   
   would lead to X30 being a decisive win either in terms of performance.   
      
      
      
   It might make more sense, if supporting BID, to just do it as its own   
   thing (and embrace just using a bunch of 128-bit arithmetic, and a   
   128*128=>256 bit widening multiply, ...). Also, can note that the BID   
   case ends up needing a lot more clutter, mostly again because C lacks   
   native support for 128-bit arithmetic.   
      
   If working based on digit chunks, likely better to stick with DPD due to   
   less clutter, etc. Though, this part would be less bad if C had had   
   widespread support for 128-bit integers.   
      
      
      
   Though, in this case, the ADD and MUL operations currently work by   
   internally doubling the width and then narrowing the result after   
   normalization. This is slower, but could give exact results.   
      
      
   Though, still not complete nor confirmed to produce correct results.   
      
      
      
   But, yeah, might be more worthwhile to look into digit chunking:   
    12x 3 digits (16b chunk)   
    4x 9 digits (32b chunk)   
    2x 18 digits (64b chunk)   
    3x 12 digits (64b chunk)   
      
   Likely I think:   
   3 digits, likely slower because of needing significantly more operations;   
   9 digits, seemed sensible, option I went with, internal operations fully   
   fit within the limits of 64 bit arithmetic;   
   18 digits, possible, but runs into many cases internally that would   
   require using 128-bit arithmetic.   
      
   12 digits, fits more easily into 64-bit arithmetic, but would still   
   sometimes exceed it; and isn't that much more than 9 digits (but would   
   reduce the number of chunks needed from 4 to 3).   
      
      
   While 18 digits conceptually needs fewer abstract operations than 9   
   digits, it would suffer the drawback of many of these operations being   
   notably slower.   
      
   However, if running on RV64G with the standard ABI, it is likely the   
   9-digit case would also take a performance hit due to sign-extended   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|