home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 129,518 of 131,241   
   BGB to MitchAlsup   
   Re: Random: Very Low Precision FP   
   27 Aug 25 01:17:47   
   
   From: cr88192@gmail.com   
      
   On 8/26/2025 4:17 PM, MitchAlsup wrote:   
   >   
   > BGB  posted:   
   >   
   >> Well, idea here is that sometimes one wants to be able to do   
   >> floating-point math where accuracy is a very low priority.   
   >>   
   >> Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16   
   >> for (though, what I am thinking of here is low-precision even by   
   >> Binary16 standards).   
   >   
   > For 8-bit stuff, just use 5 memory tables [256×256]   
   >   
      
   Would work OK for scalar ops on a CPU; less great for SIMD (or other   
   cases where one can't afford a 64K lookup table).   
      
   For 4x FP8 on a CPU, it makes sense to just use a normal SIMD unit for this.   
      
   But, say, what if one wants 8x or 16x wide SIMD with FP8; or use within   
   more specialized units?...   
      
      
   Granted, FP8 multiply is fairly cheap either way (eg: the mantissa   
   multiply already fits into LUT6's and can be pulled off in combinatorial   
   logic). It is mostly FP8 FADD that needs to be made cheaper and low   
   latency; sadly approximating FADD/FSUB in the general case has typically   
   been the harder problem.   
      
      
   And, while a special case has been found (works for simple add with   
   similar exponents), it doesn't extend to the general case.   
      
      
   >> But, will use Binary16 and BF16 as the example formats.   
   >>   
   >> So, can note that one can approximate some ops with modified integer   
   >> ADD/SUB (excluding sign-bit handling):   
   >>    a*b    : A+B-0x3C00  (0x3F80 for BF16)   
   >>    a/b    : A-B+0x3C00   
   >>    sqrt(a): (A>>1)+0x1E00   
   >   
   > You are aware that GPUs perform elementary transcendental functions   
   > (32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.   
   > These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,   
   > Matula circa 2000-2005 for relevant data. I did a crack at this   
   > (patented: Samsung) that got within 0.7 and 1.2 ULP using a three   
   > term polynomial instead of a 2 term polynomial.   
   > Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now   
   > IEEE 754 accurate (except for a couple of outlying cases.)   
   >   
   > So, I don't see this suggestion bringing value to the table.   
   >   
      
   These can do some operations more cheaply.   
      
   But, as noted, primarily for low precision values, likely in dedicated   
   logic.   
      
   This approach would not make as much sense in general for larger   
   formats, given the accuracy is far below what would be considered as   
   acceptable.   
      
   Though. a few of these were already in use (as CPU helper ops), though   
   usually to provide "starter-values" for Newton-Raphson.   
      
      
   But, this sort of thing, is unlikely to replace general-purpose SIMD ops   
   on a CPU or similar in any case.   
      
   And, for the SIMD unit, can continue doing floating-point in ways that   
   "are not complete garbage".   
      
      
   But, say, for working with HDR values inside the rasterizer module or   
   similar, this is more where this sort of thing could matter.   
      
   Or, maybe, could be relevant for perspective-correct texture filtering   
   (well, if it were working with floating-point texture coords rather than   
   fixed point).   
      
   Might be better if the module could also do transform and deal with full   
   primitives, but this is asking too much.   
      
   Or, failing this, if it could be used for 2D "blit" operations   
   (currently only deals with square or rectangular power-of-2 images in   
   Morton Order, which isn't terribly useful for "blit").   
      
   Though, as noted, TKRA-GL keeps its textures internally in Morton Order.   
      
   Currently, TKRA-GL uses a 12-bit lineal Z-Buffer (with 4 bits for   
   stencil), though it is possible that it could make sense to use floating   
   point for the Z-buffer (maybe S.E3.F8; as it mostly only needs to hold   
   values between -1.0 and 1.0, etc).   
      
      
   Some of the audio modules also use values mostly in A-Law form.   
   Though, annoyingly it seems I have now ended up stuck with both A-Law   
   formats with Bias=7 and Bias=8. As, initially I added the ops primarily   
   for audio and Bias=8 made sense, but for other (non audio) uses I more   
   needed Bias=7 (renamed as FP8A). So, annoyingly, there are now two sets   
   of converter ops differing primarily in bias.   
      
   But, I am mostly phasing out FP8S (E4.F3.S) in favor of plain FP8   
   (S.E4.F3). Though, it lingers on as a sort of a design mistake, much   
   like making my A-Law ops originally Bias=8. But, then FP8A remains   
   preferably mostly because it has a slightly larger mantissa than normal FP8.   
      
      
   ...   
      
      
   >> The harder ones though, are ADD/SUB.   
   >>   
   >> A partial ADD seems to be:   
   >>     a+b: A+((B-A)>>1)+0x0400   
   >>   
   >> But, this simple case seems not to hold up when either doing subtract,   
   >> or when A and B are far apart.   
   >>   
   >> So, it would appear either that there is a 4th term or the bias is   
   >> variable (depending on the B-A term; and for ADD/SUB).   
   >>   
   >> Seems like the high bits (exponent and operator) could be used to drive   
   >> a lookup table, but this is lame, The magic bias appears to have   
   >> non-linear properties so isn't as easily represented with basic integer   
   >> operations.   
   >>   
   >> Then again, probably other people know about all of this and might know   
   >> what I am missing.   
   >   
   > I still recommend getting the right answer over getting a close but wrong   
   > answer a couple cycles earlier.   
      
      
   A lot depends on what is needed...   
      
   In cases where a person is doing math using FP8, any real semblance of   
   "accuracy" or "right answer" has already gone out the window. Usually   
   the math is just sorta throwing darts at a dartboard and hoping they   
   land somewhere in the right area.   
      
   Though, that said, usually even with these sorts of approximations (such   
   as approximating a FMUL with a modified ADD), often the top 3-5 bits of   
   the mantissa are correct. So, for FP8 or BF16, the answer of the   
   approximates in many cases still may be close to the answer if done   
   using real floating-point logic.   
      
      
   But, even for something like Binary16, it is a bit iffy.   
      
   There are 10 bits of mantissa, and math ops that only give around 3-5   
   bits of accuracy or do isn't great in this case.   
      
   Though, sometimes the accuracy doesn't matter that much, but one may   
   still want to avoid "stair steps" as the artifacts generated by this may   
   be much more obvious.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca