home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 131,114 of 131,241   
   Terje Mathisen to Robert Finch   
   Re: A useless machine   
   16 Feb 26 16:08:35   
   
   From: terje.mathisen@tmsw.no   
      
   Robert Finch wrote:   
   > On 2026-02-15 4:59 p.m., Terje Mathisen wrote:   
   >> Thomas Koenig wrote:   
   >>> Terje Mathisen  schrieb:   
   >>>   
   >>>> Just thinking about a way to replace ~N iterations by a MUL, ADD and    
   >>>> SHR   
   >>>>   
   >>>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC    
   >>>> pairs   
   >>>> and a SHRD/SHR, so maybe 8-10 cycles?   
   >>>   
   >>> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives   
   >>> an improved algorithm, which they are using for the supercomputer   
   >>> version.  They are doing:   
   >>>   
   >>> n = n0   
   >>> repeat   
   >>>    n = n + 1   
   >>>    alpha = ctz(n)   
   >>>    n = n * 3^alpha / 2^alpha   
   >>   
   >> Strange syntax...   
   >>   
   >> ctz(n) must count trailing zeroes, right?   
   >> If so, then ctz(n+1) will return 1 or more for odd n, so that gives    
   >> 3n+3, which after the division by 2 returns a value which is one too    
   >> high.   
   >>   
   >>>    n = n - 1   
   >>   
   >> but this is corrected here!   
   >>   
   >> The key must be that the multiplication by a power of three also works    
   >> for n with multiple trailing 1 bits.   
   >>   
   >> The "x/2^alpha" part is of course just x>>alpha.   
   >>   
   >> For an even n they still do a multiplication by 1 and a SHR 0, but the    
   >> code is branchless.   
   >>   
   >>>    beta = ctz(n)   
   >>>    n = n / 2^beta   
   >>> until n < n_0   
   >>   
   >> I am guessing that the best way to do the n*3^alpha parts is a table    
   >> of powers of 3, so   
   >>   
   >> let mut n = n0;   
   >> loop {   
   >>  Â  n += 1;   
   >>  Â  let alpha = ctz(n);   
   >>  Â  n = (n * power_of_three[alpha]) >> alpha;   
   >>  Â  n -= 1;   
   >>  Â  beta = ctz(n);   
   >>  Â  n >>= beta;   
   >>  Â  if n <= n_0 {break};   
   >> }   
   >>   
   >>>   
   >>> They are also playing some strange tricks with parallel solving that   
   >>> are not described clearly, or I am not clever enough to understand   
   >>> what they are doing. I would have to look at their code, I guess.   
   >>>   
   >> :-)   
   >>   
   >> Terje   
   >>   
   > How many iterations are done in one loop? I think it would need to be at    
   > least 10 on average to beat the simple approach. The simple approach    
   > only taking one clock per iteration. The 128-bit multiply and 128-bit    
   > count trailing zeros are slow in an FPGA.   
      
   Right.   
      
   The first part will handle trailing strings of 1's, while the second    
   part does the same for all trailing zeroes after that initial    
   mul_by_power_of_three.   
   >    
   > And, can the approach consume a more limited number of iterations? For   
   > example counting only the 6 LSB zeros?   
      
   It should work directly with a max(ctz,N) wrapper.   
   >    
   > I found a mistake in my code handling groups. It can only do about a    
   > 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4   
   > million years?   
      
   FPGAs typically run at much, much lower frequencies than modern CPUs, so   
   if the code above can average 3-5 iterations per loop, and do it in    
   around 10 clock cycles, then the FPGA will be hard pressed to keep up?   
      
   Terje   
      
   --    
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca