Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,294 of 131,241    |
|    BGB to Terje Mathisen    |
|    Re: Tonights Tradeoff (1/2)    |
|    14 Nov 25 14:39:30    |
   
   From: cr88192@gmail.com   
      
   On 11/14/2025 8:57 AM, Terje Mathisen wrote:   
   > BGB wrote:   
   >> It is possible to use an approach similar to double-dabble (feeding in   
   >> the binary number 1 bit at a time, and adding the decimal vector to   
   >> itself and incrementing for each 1 bit seen). But, alas, this is also   
   >> slow in this case (takes around 128 iterations to convert the Int128   
   >> to 4x 10e9). Though, still slightly faster than using a shift-subtract   
   >> divider to crack off 9 digit chunks by successively dividing by   
   >> 1000000000.   
   >>   
   >>   
   >> Or, maybe make another attempt at Radix-10e9 long division and see if   
   >> I can get it to actually work and give the correct result.   
   >   
   > I used division by 1e9 to extract groups of 9 digits from the binary   
   > result I got when calculating pi with arbitrary precision, back then (on   
   > a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder   
   > (edx) and result (eax) in a loop, which was fast enough for tsomething I   
   > only needed to do once.   
   >   
   > Today, with 64-bit cpus, why not use a reciprocal mul to get a value   
   > that cannot be too high, save the result, then back-multiply and subtract?   
   >   
      
   Dunno.   
      
   I guess, 128/64 bit IDIV could be possible on the HW, but there isn't a   
   good way to access this from C (absent using functionality outside what   
   exists in portable C95).   
      
   Could be possible, but at the moment don't really want to go the   
   direction of alternate code paths, and wildly different performance   
   based on what compiler one is using.   
      
      
   Though, in simple cases, the compilers are smart enough to turn   
   divide-by-constant into multiply-by-reciprocal internally (if they   
   support the type in question).   
      
      
      
   If anything, GCC having __int128 support, leans a lot more in favor of   
   using BID rather than DPD.   
      
   I decided against using BID with this code for the reason that this   
   bottleneck would unnecessarily penalize BID, which would be better   
   handled in a different way (namely writing code which works natively   
   with numbers in linear integer form; and not in 10e9 form).   
      
      
   As noted, a double-dabble approach is, say:   
    v=a.hi;   
    for(i=0; i<64; i++)   
    {   
    TKD128_AddArray4(arr, arr, arr);   
    arr[0]+=v>>63;   
    v=v<<1;   
    }   
    v=a.lo;   
    for(i=0; i<64; i++)   
    {   
    TKD128_AddArray4(arr, arr, arr);   
    arr[0]+=v>>63;   
    v=v<<1;   
    }   
   Which, technically works, but doesn't really win any awards for speed...   
      
      
   > Any off-by-one error will be caught by the next iteration.   
   >   
      
   Yeah...   
      
      
   Though, did make another attempt at 10e9 long division, and got it   
   working correctly this time...   
      
   /* arem is both dividend and remainder, 8 elements; padded.   
    * adiv is the divisor (4 elements).   
    * aquo is the output quotient (also 8 elements)   
    * result derived from high elements.   
    */   
   void TKD128_LongDivArray8x4(u32 *arem, u32 *adiv, u32 *aquo)   
   {   
    u32 adtmp[8];   
    u64 adx, ady, tdiv;   
    u32 ad0, ad1, ad2, ad3, or8;   
    int i, j, n, re;   
      
    memset(aquo, 0, 8*sizeof(u32));   
    adtmp[0]=0; adtmp[1]=0;   
    adtmp[2]=0; adtmp[3]=0;   
    adtmp[4]=adiv[0]; adtmp[5]=adiv[1];   
    adtmp[6]=adiv[2]; adtmp[7]=adiv[3];   
      
    tdiv=adiv[3]; /* assume not zero... */   
      
    for(i=0; i<5; i++)   
    {   
    /* doesn't always work in a single pass, usually 1 or 2 */   
    for(j=0; j<4; j++)   
    {   
    ad0=arem[7-i];   
    ad1=arem[8-i];   
    adx=(ad1*1000000000ULL)+ad0;   
    ady=adx/(tdiv+1);   
    if(!ady)   
    break; /* if was zero, this position is done */   
    ad2=ady;   
    if(ady>=2000000000) /* range limit so no overflow */   
    { ad2=1999999999; }   
    if(ad2>0)   
    {   
    TKD128_SubScaleArray8X_30(arem, adtmp, ad2, arem);   
    ad3=aquo[0];   
    ad3+=ad2;   
    if(ad3>=1000000000)   
    { aquo[1]++; ad3-=1000000000; }   
    aquo[0]=ad3;   
    }   
    }   
    TKD128_ScaleLeftArray8_S9(aquo);   
    TKD128_ScaleRightArray8_S9(adtmp);   
    }   
    for(; i<8; i++)   
    TKD128_ScaleLeftArray8_S9(aquo);   
   }   
      
   TKD128_ScaleLeftArray8_S9:   
   Copy elements left (towards higher index) by 1 position (32 bits).   
      
   TKD128_ScaleRightArray8_S9:   
   Copy elements right (towards a lower index) by 1 position (32 bits).   
      
   TKD128_SubScaleArray8X_30(c, a, b, d):   
    v=a[i]*b;   
    v_h=v/1000000000;   
    v_l=v-v_h*1000000000;   
    d[i+0]=c[i+0]-v_l;   
    d[i+1]=c[i+1]-v_h;   
    With extra parts to deal with borrow propagation and similar.   
    May access out-of-bounds for c/d arrays   
    (arem needs to be padded by a few extra elements).   
      
   Note that is runs for 5 iterations (vs 4) because this is how one gets   
   it to produce a full fraction rather than an integer divide (the integer   
   divide results are similar, but differ on the low order digits).   
      
   Running for 5 appears sufficient (could run for 6..8, but these appear   
   to deliver the same final result and are slower).   
      
   One could debate whether stopping early could effect the results, but   
   the low-order digits are initialized to 0, and 1000000000-999999999 is   
   000000001, meaning that in the worst case, the low-order borrows would   
   be absorbed in this case.   
      
      
      
   Performance:   
    Slightly faster than using N-R.   
    But, still nowhere near what I had hoped.   
      
      
   >>   
   >> Though, might be worthwhile, since if I could make the DIV operator   
   >> faster, I could claim a result of "faster than IBM's decNumber library".   
   >   
   > :-)   
   >   
      
   Currently, ADD/SUB/MUL seem to be faster in my case.   
      
   DIV is still slower, and seems to be putting up a big fight here.   
    decNumber seems to have a DIV that is around 1/2 the speed of the MUL.   
    In my case, DIV is still around 10x slower than MUL.   
      
   Currently I have it at 3.4 MHz in GCC, 2.7 MHz in MSVC.   
      
   To match decNumber, would need to get closer to around 6 million divides   
   per second (in GCC).   
      
      
   Current stats in a GCC build are:   
    ADD/SUB: 36 MHz (unpacked), 17 MHz (DPD)   
    MUL: 27 MHz (unpacked), 14 MHz (DPD)   
    DIV: 3.4 MHz (both)   
    SQRT: 1.0 MHz (both)   
      
      
   MSVC scores:   
    ADD/SUB: 13 MHz (unpacked), 0.8 MHz (DPD)   
    MUL: 18 MHz (unpacked), 1.0 MHz (DPD)   
    DIV: 2.7 MHz, 0.7 MHz (DPD)   
    SQRT: 0.8 MHz, 0.6 MHz (DPD)   
      
   Everything is slower here with MSVC it seems...   
    The DPD pack/unpack kinda wrecks things.   
    The X30 pack/unpack is around 36% faster than DPD.   
      
   Not entirely sure why MSVC is sucking so badly here (it doesn't usually   
   suck this bad).   
      
   Checking Clang:   
   It is slightly faster than MSVC, but much closer to the MSVC performance   
   than the GCC performance in this case (so, whatever issue seems to be   
   effecting MSVC here also appears to effect Clang).   
      
      
   This may require investigation, but then again, a lot of this isn't   
   exactly "high performance" code (and does a lot of stuff I would   
   normally avoid, but was basically unavoidable due to the whole   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca