From: terje.mathisen@nospicedham.tmsw.no   
      
   Anton Ertl wrote:   
   > Terje Mathisen writes:   
   >> For a 32-bit division by X you calculate (1/X) * 2^n, with n just   
   >> big enough for the result to be between 2^31 and 2^32-1.   
   >   
   > An alternative is to compute RX=ceil(2^64/X), giving a 64-bit number   
   > (except for X=1, so you have to special-case for that if you want it   
   > totally general). Then you can do unsigned division of Y by X as   
   > follows:   
   >   
   > P = (Y * RX) >> 64   
   >   
   > You need to implement 64*32->96 multiplication for this (requiring 2   
   > 32*32->64 multiplications), and then take only the upper 32 bits of   
   > the result.   
   >   
   > This eliminates the shift by a non-multiple of 32, and the   
   > correction step that Terje's algorithm performs; in the literature   
   > you find an algorithm that requires a 33-bit reciprocal and shifts,   
   > but that is definitely slower than the method above.   
      
   Anton, your approach is _possibly_ faster if you have both a very fast   
   and pipelined (so the two MULs can run overlapped) 32x32->64 MUL, and   
   ADD/ADC for easy carry propagation.   
      
   For half of all possible divisors the 33-bit reciprocal will end with a   
   zero, so you don't need any fixup code, and for the remainder you can do   
   that fixup quite quickly as long as the shift is single-cycle.   
      
   If you want generic code then you also need a mask to apply/don't apply   
   the fixup, so I can imagine how modern 3 or 4-cycle MULs make your   
   alternative quite attractive.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|