... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,162 of 4,675
Terje Mathisen to Bernhard Schornak
Re: Palindromic number
07 Dec 17 11:51:49
   From: terje.mathisen@nospicedham.tmsw.no   
      
   Bernhard Schornak wrote:   
   > Terje Mathisen wrote:   
      
   >> For 32-bit unsigned that algorithm uses about 30 cycles on a 10+   
   >> year old cpu, the 64-bit version, with 3 or 4-way split, should do   
   >> it in less than twice that time.   
   >   
   > 30 cycles is impossible. The below code is branch-free and needs 152   
   > cycles (tested on AMD Bulldozer) without formatting. Leaving out the   
   > formatting and other specific stuff not required for the given task,   
   > you could reduce execution time to about 120 clocks, but surely not   
   > less.   
      
   Read again, I never claimed 30 cycles for 64-bit, but that something   
   less than 60 should be possible on a cpu with more than 2 or 3-way   
   superscalarity.   
   >   
   > (We have to wait for the result of the MulDiv to get a remainder for   
   > the next step, so it is impossible to serialise processing.)   
      
   I didn't even try to understand your algorithm from the (mostly)   
   uncommented asm code, can you please explain it?   
      
   The way my algorithm works (for 32-bit) is like this:   
      
   Do a reciprocal mul in order to divide by 100 000, then backmultiply and   
   subtract in order to get the two parts: This is two serial MULs.   
      
   The rest of the algorithm can run in parallel for the two halves:   
      
   Scale the number (another MUL) so that it ends up as a fixed-point   
   m.nnnnn value with the first digit (m) in the top 4 bits, extract that   
   digit with a SHR 28 and save it to the top slot.   
      
      REPM 4   
        mask away the previous high digit   
        LEA reg,[reg+reg*4] to multiply by 5   
        Extract the new digit with a shift value one less (27,26,25,24)   
      ENDM   
      
   When a MUL takes 4 cycles we have 3 serial MULs and a set of   
   single-cycle SHR/AND/LEA operations that perfectly overlap the   
   processing of the two halves.   
      
   For 64 bit I have considered both a 4-way split, i.e. wrapping two   
   instances of the 32-bit code into an initial split along 1e10, this   
   would require 64-bit operations on the low half which could contain   
   values up to 9.999..e9   
      
   The alternative is to split 3 ways, with 7 digits in each part, this   
   could be a little bit faster since it matches better with the current   
   number of execution units in modern cpus, but I haven't checked   
   carefully that a scaled 7-digit value would work correctly for all   
   possible inputs (using just 32-bit ops). However since the number of   
   inputs is so low a full exhaustive test would be quick to do.   
      
   With a 3-way split I was thinking that it should be possible to overlap   
   some of the work:   
      
      top13 = x / 10000000LL;		;; ~5 cycles via reciprocal MUL   
      top6 = x / 100000000000000LL;		;; One additional cycle   
      
      bottom7 = x - top13*10000000LL;	;; ~5 cycles   
      middle7 = top13 - top6*10000000LL;	;; One additional cycle   
      
      
      top6 *= scale;			;; ~4 cycles   
      bottom7 *= scale;			;; +1   
      middle7 *= scale;			;; +1   
      
   So at this point we have used about 18 cycles, add 3 more for each   
   additional cycle of MUL latency above 4.   
      
   Past this point we're back in SHR/AND/LEA territory but with three pipes   
   in parallel.   
      
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]