Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,162 of 4,675    |
|    Terje Mathisen to Bernhard Schornak    |
|    Re: Palindromic number    |
|    07 Dec 17 11:51:49    |
      From: terje.mathisen@nospicedham.tmsw.no              Bernhard Schornak wrote:       > Terje Mathisen wrote:              >> For 32-bit unsigned that algorithm uses about 30 cycles on a 10+       >> year old cpu, the 64-bit version, with 3 or 4-way split, should do       >> it in less than twice that time.       >       > 30 cycles is impossible. The below code is branch-free and needs 152       > cycles (tested on AMD Bulldozer) without formatting. Leaving out the       > formatting and other specific stuff not required for the given task,       > you could reduce execution time to about 120 clocks, but surely not       > less.              Read again, I never claimed 30 cycles for 64-bit, but that something       less than 60 should be possible on a cpu with more than 2 or 3-way       superscalarity.       >       > (We have to wait for the result of the MulDiv to get a remainder for       > the next step, so it is impossible to serialise processing.)              I didn't even try to understand your algorithm from the (mostly)       uncommented asm code, can you please explain it?              The way my algorithm works (for 32-bit) is like this:              Do a reciprocal mul in order to divide by 100 000, then backmultiply and       subtract in order to get the two parts: This is two serial MULs.              The rest of the algorithm can run in parallel for the two halves:              Scale the number (another MUL) so that it ends up as a fixed-point       m.nnnnn value with the first digit (m) in the top 4 bits, extract that       digit with a SHR 28 and save it to the top slot.               REPM 4        mask away the previous high digit        LEA reg,[reg+reg*4] to multiply by 5        Extract the new digit with a shift value one less (27,26,25,24)        ENDM              When a MUL takes 4 cycles we have 3 serial MULs and a set of       single-cycle SHR/AND/LEA operations that perfectly overlap the       processing of the two halves.              For 64 bit I have considered both a 4-way split, i.e. wrapping two       instances of the 32-bit code into an initial split along 1e10, this       would require 64-bit operations on the low half which could contain       values up to 9.999..e9              The alternative is to split 3 ways, with 7 digits in each part, this       could be a little bit faster since it matches better with the current       number of execution units in modern cpus, but I haven't checked       carefully that a scaled 7-digit value would work correctly for all       possible inputs (using just 32-bit ops). However since the number of       inputs is so low a full exhaustive test would be quick to do.              With a 3-way split I was thinking that it should be possible to overlap       some of the work:               top13 = x / 10000000LL; ;; ~5 cycles via reciprocal MUL        top6 = x / 100000000000000LL; ;; One additional cycle               bottom7 = x - top13*10000000LL; ;; ~5 cycles        middle7 = top13 - top6*10000000LL; ;; One additional cycle                      top6 *= scale; ;; ~4 cycles        bottom7 *= scale; ;; +1        middle7 *= scale; ;; +1              So at this point we have used about 18 cycles, add 3 more for each       additional cycle of MUL latency above 4.              Past this point we're back in SHR/AND/LEA territory but with three pipes       in parallel.                     Terje              --       - |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca