Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,199 of 131,241    |
|    Robert Finch to BGB    |
|    Re: Tonights Tradeoff (2/3)    |
|    07 Nov 25 22:18:08    |
      [continued from previous message]              >> require using 128-bit arithmetic.       >>       >> 12 digits, fits more easily into 64-bit arithmetic, but would still       >> sometimes exceed it; and isn't that much more than 9 digits (but would       >> reduce the number of chunks needed from 4 to 3).       >>       >>       >> While 18 digits conceptually needs fewer abstract operations than 9       >> digits, it would suffer the drawback of many of these operations being       >> notably slower.       >>       >> However, if running on RV64G with the standard ABI, it is likely the       >> 9- digit case would also take a performance hit due to sign-extended       >> unsigned int (and needing to spend 2 shifts whenever zero-extending a       >> value).       >>       >>       >> With 3x 12 digits,while not exactly the densest scheme, leaves a       >> little more "working space" so would reduce cases which exceed the       >> limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...       >>       >> The main merit of 9 digit chunking here being that it fully stays       >> within the limits of 64-bit arithmetic (where multiply temporarily       >> widens to working with 18 digits, but then narrows back to 9 digit       >> chunks).       >>       >> Also 9 digit chunking may be preferable when one has a faster       >> 32*32=>64 bit multiplier, but 64*64=>128 is slower.       >>       >>       >> One other possibility could be to use BCD rather than chunking, but I       >> expect BCD emulation to be painfully slow in the absence of ISA level       >> helpers.       >>       >       > I don't know yet if my implementation of DPD is actually correct.       >       > Seems Decimal128 DPD is obscure enough that I don't currently have any       > alternate options to confirm if my encoding is correct.       >       > Here is an example value:       > 2DFFCC1AEB53B3FB_B4E262D0DAB5E680       >       > Which, in theory, should resemble PI.       >       >       > Annoyingly, it seems like pretty much everyone else either went with       > BID, or with other non-standard Decimal encodings.       >       > Can't seem to find:       > Any examples of hard-coded numbers in this format on the internet;       > Any obvious way to generate them involving "stuff I already have".       > As, in, not going and using some proprietary IBM library or similar.       >       > Also Grok wasn't much help here, just keeps trying to use Python's       > "decimal", which quickly becomes obvious is not using Decimal128 (much       > less DPD), but seemingly some other 256-bit format.       >       > And, Grok fails to notice that what it is saying is nowhere close to       > correct in this case.       >       > Neither DeepSeek nor QWen being much help either... Both just sort of go       > down a rabbit hole, and eventually fall back to "Here is how you might       > go about trying to decode this format...".       >       >       > Not helpful, I more would just want some way to confirm whether or not I       > got the format correct.       >       > Which is easier if one has some example numbers or something that they       > can decode and verify the value, or something that is able to decode       > these numbers (which isn't just trying to stupidly shove it into       > Python's Decimal class...).       >       >       > Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,       > and Boost C++, but in these cases, less helpful because they went with BID.       >       > ...       >       >       >       >       > Checking, after things a a little more complete, MHz for (millions of       > times per second), on my desktop PC:       > DPD Pack/Unpack: 63.7 MHz (58 cycles)       > X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...       >       > FMUL (unwrap) : 21.0 MHz (176 cycles)       > FADD (unwrap) : 11.9 MHz (311 cycles)       >       > FDIV : 0.4 MHz (very slow; Newton Raphson)       >       > FMUL (DPD) : 11.2 MHz (330 cycles)       > FADD (DPD) : 8.6 MHz (430 cycles)       > FMUL (X30) : 12.4 MHz (298 cycles)       > FADD (X30) : 9.8 MHz (378 cycles)       >       > The relative performance impact of the wrap/unwrap step is somewhat       > larger than expected (vs the unwrapped case).       >       > Though, there seems to only be a small difference here between DPD and       > X30 (so, likely whatever is effecting performance here is not directly       > related to the cost of the pack/unpack process).       >       > The wrapped cases basically just add a wrapper function that unpacks the       > input values to the internal format, and then re-packs the result.       >       > For using the wrapped functions to estimate pack/unpack cost:       > DPD cost: 51 cycles.       > X30 cost: 41 cycles.       >       >       > Not really a good way to make X30 much faster. It does pay for the cost       > of dealing with the combination field.       >       > Not sure why they would be so close:       > DPD case does a whole lot of stuff;       > X30 case is mostly some shifts and similar.       >       > Though, in this case, it does use these functions by passing/returning       > structs by value. It is possible a by-reference design might be faster       > in this case.       >       >       > This could possibly be cheapened slightly by going to, say:       > S.E13.M114       > In effect trading off some exponent range for cheaper handling of the       > exponent.       >       >       > Can note:       > MUL and ADD use double-width internal mantissa, so should be accurate;       > Current test doesn't implement rounding modes though, could do so.       > Currently hard-wired at Round-Nearest-Even.       >       > DIV uses Newton-Raphson       > The process of converging is a lot more fiddly than with Binary FP.       > Partly as the strategy for generating the initial guess is far less       > accurate.       >       > So, it first uses a loop with hard-coded checks and scales to get it in       > the general area, before then letting N-R take over. If the value isn't       > close enough (seemingly +/- 25% or so), N-R flies off into space.       >       > Namely:       > Exponent is wrong:       > Scale by factors of 2 until correct;       > Off by more than 50%, scale by +/- 25%;       > Off by more than 25%, scale by +/- 12.5%;       > Else: Good enough, let normal N-R take over.       >       > Precondition step is usually simpler with Binary-FP as the initial guess       > is usually within the correct range. So, one can use a single modified       > N-R step (that undershoots) followed by letting N-R take over.       >       > More of an issue though when the initial guess is "maybe within a factor       > of 10" because the usual reciprocal-approximation strategy used for       > Binary-FP isn't quite as effective.       >       >       > ...       >       >       > Still don't have a use-case, mostly just messing around with this...       >       >              When I built my decimal float code I ran into the same issue. There are       not really examples on the web. I built integer to decimal-float and       decimal-float to integer converters then compared results.              Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are       right, no guarantees).        Integer decimal-float       u 00000000000000000000000000000001 25ffc000000000000000000000000000       u 0000000000000000000000000000000a 26000000000000000000000000000000       u 00000000000000000000000000000064 26004000000000000000000000000000       u 000000000000000000000000000003e8 26008000000000000000000000000000       u 000000000000000000000000000f4240 26014000000000000000000000000000       u 00000000000000000000000000bc614e 2601934b9c0c00000000000000000000       u 00000000000000000000000000000002 29ffc000000000000000000000000000                     I have used the decimal float code (96 bit version) with Tiny BASIC and       it seems to work.              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca