... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,199 of 131,241
Robert Finch to BGB
Re: Tonights Tradeoff (2/3)
07 Nov 25 22:18:08
   [continued from previous message]   
      
   >> require using 128-bit arithmetic.   
   >>   
   >> 12 digits, fits more easily into 64-bit arithmetic, but would still   
   >> sometimes exceed it; and isn't that much more than 9 digits (but would   
   >> reduce the number of chunks needed from 4 to 3).   
   >>   
   >>   
   >> While 18 digits conceptually needs fewer abstract operations than 9   
   >> digits, it would suffer the drawback of many of these operations being   
   >> notably slower.   
   >>   
   >> However, if running on RV64G with the standard ABI, it is likely the   
   >> 9- digit case would also take a performance hit due to sign-extended   
   >> unsigned int (and needing to spend 2 shifts whenever zero-extending a   
   >> value).   
   >>   
   >>   
   >> With 3x 12 digits,while not exactly the densest scheme, leaves a   
   >> little more "working space" so would reduce cases which exceed the   
   >> limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...   
   >>   
   >> The main merit of 9 digit chunking here being that it fully stays   
   >> within the limits of 64-bit arithmetic (where multiply temporarily   
   >> widens to working with 18 digits, but then narrows back to 9 digit   
   >> chunks).   
   >>   
   >> Also 9 digit chunking may be preferable when one has a faster   
   >> 32*32=>64 bit multiplier, but 64*64=>128 is slower.   
   >>   
   >>   
   >> One other possibility could be to use BCD rather than chunking, but I   
   >> expect BCD emulation to be painfully slow in the absence of ISA level   
   >> helpers.   
   >>   
   >   
   > I don't know yet if my implementation of DPD is actually correct.   
   >   
   > Seems Decimal128 DPD is obscure enough that I don't currently have any   
   > alternate options to confirm if my encoding is correct.   
   >   
   > Here is an example value:   
   >    2DFFCC1AEB53B3FB_B4E262D0DAB5E680   
   >   
   > Which, in theory, should resemble PI.   
   >   
   >   
   > Annoyingly, it seems like pretty much everyone else either went with   
   > BID, or with other non-standard Decimal encodings.   
   >   
   > Can't seem to find:   
   >    Any examples of hard-coded numbers in this format on the internet;   
   >    Any obvious way to generate them involving "stuff I already have".   
   >      As, in, not going and using some proprietary IBM library or similar.   
   >   
   > Also Grok wasn't much help here, just keeps trying to use Python's   
   > "decimal", which quickly becomes obvious is not using Decimal128 (much   
   > less DPD), but seemingly some other 256-bit format.   
   >   
   > And, Grok fails to notice that what it is saying is nowhere close to   
   > correct in this case.   
   >   
   > Neither DeepSeek nor QWen being much help either... Both just sort of go   
   > down a rabbit hole, and eventually fall back to "Here is how you might   
   > go about trying to decode this format...".   
   >   
   >   
   > Not helpful, I more would just want some way to confirm whether or not I   
   > got the format correct.   
   >   
   > Which is easier if one has some example numbers or something that they   
   > can decode and verify the value, or something that is able to decode   
   > these numbers (which isn't just trying to stupidly shove it into   
   > Python's Decimal class...).   
   >   
   >   
   > Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,   
   > and Boost C++, but in these cases, less helpful because they went with BID.   
   >   
   > ...   
   >   
   >   
   >   
   >   
   > Checking, after things a a little more complete, MHz for (millions of   
   > times per second), on my desktop PC:   
   >    DPD Pack/Unpack: 63.7 MHz (58 cycles)   
   >    X30 Pack/Unpack: 567 MHz  ( 7 cycles) ?...   
   >   
   >    FMUL (unwrap)  : 21.0 MHz (176 cycles)   
   >    FADD (unwrap)  : 11.9 MHz (311 cycles)   
   >   
   >    FDIV           :  0.4 MHz (very slow; Newton Raphson)   
   >   
   >    FMUL (DPD)     : 11.2 MHz (330 cycles)   
   >    FADD (DPD)     :  8.6 MHz (430 cycles)   
   >    FMUL (X30)     : 12.4 MHz (298 cycles)   
   >    FADD (X30)     :  9.8 MHz (378 cycles)   
   >   
   > The relative performance impact of the wrap/unwrap step is somewhat   
   > larger than expected (vs the unwrapped case).   
   >   
   > Though, there seems to only be a small difference here between DPD and   
   > X30 (so, likely whatever is effecting performance here is not directly   
   > related to the cost of the pack/unpack process).   
   >   
   > The wrapped cases basically just add a wrapper function that unpacks the   
   > input values to the internal format, and then re-packs the result.   
   >   
   > For using the wrapped functions to estimate pack/unpack cost:   
   >    DPD cost: 51 cycles.   
   >    X30 cost: 41 cycles.   
   >   
   >   
   > Not really a good way to make X30 much faster. It does pay for the cost   
   > of dealing with the combination field.   
   >   
   > Not sure why they would be so close:   
   >    DPD case does a whole lot of stuff;   
   >    X30 case is mostly some shifts and similar.   
   >   
   > Though, in this case, it does use these functions by passing/returning   
   > structs by value. It is possible a by-reference design might be faster   
   > in this case.   
   >   
   >   
   > This could possibly be cheapened slightly by going to, say:   
   >    S.E13.M114   
   > In effect trading off some exponent range for cheaper handling of the   
   > exponent.   
   >   
   >   
   > Can note:   
   >    MUL and ADD use double-width internal mantissa, so should be accurate;   
   >    Current test doesn't implement rounding modes though, could do so.   
   >      Currently hard-wired at Round-Nearest-Even.   
   >   
   > DIV uses Newton-Raphson   
   > The process of converging is a lot more fiddly than with Binary FP.   
   > Partly as the strategy for generating the initial guess is far less   
   > accurate.   
   >   
   > So, it first uses a loop with hard-coded checks and scales to get it in   
   > the general area, before then letting N-R take over. If the value isn't   
   > close enough (seemingly +/- 25% or so), N-R flies off into space.   
   >   
   > Namely:   
   >    Exponent is wrong:   
   >      Scale by factors of 2 until correct;   
   >    Off by more than 50%, scale by +/- 25%;   
   >    Off by more than 25%, scale by +/- 12.5%;   
   >    Else: Good enough, let normal N-R take over.   
   >   
   > Precondition step is usually simpler with Binary-FP as the initial guess   
   > is usually within the correct range. So, one can use a single modified   
   > N-R step (that undershoots) followed by letting N-R take over.   
   >   
   > More of an issue though when the initial guess is "maybe within a factor   
   > of 10" because the usual reciprocal-approximation strategy used for   
   > Binary-FP isn't quite as effective.   
   >   
   >   
   > ...   
   >   
   >   
   > Still don't have a use-case, mostly just messing around with this...   
   >   
   >   
      
   When I built my decimal float code I ran into the same issue. There are   
   not really examples on the web. I built integer to decimal-float and   
   decimal-float to integer converters then compared results.   
      
   Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are   
   right, no guarantees).   
      Integer                       decimal-float   
   u 00000000000000000000000000000001	25ffc000000000000000000000000000   
   u 0000000000000000000000000000000a	26000000000000000000000000000000   
   u 00000000000000000000000000000064	26004000000000000000000000000000   
   u 000000000000000000000000000003e8	26008000000000000000000000000000   
   u 000000000000000000000000000f4240	26014000000000000000000000000000   
   u 00000000000000000000000000bc614e	2601934b9c0c00000000000000000000   
   u 00000000000000000000000000000002	29ffc000000000000000000000000000   
      
      
   I have used the decimal float code (96 bit version) with Tiny BASIC and   
   it seems to work.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]