home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,170 of 4,675   
   Bernhard Schornak to Terje Mathisen   
   Re: Palindromic number   
   08 Dec 17 15:43:48   
   
   From: schornak@nospicedham.web.de   
      
   Terje Mathisen wrote:   
      
      
   > Bernhard Schornak wrote:   
   >> Terje Mathisen wrote:   
   >   
   >>> For 32-bit unsigned that algorithm uses about 30 cycles on a 10+   
   >>> year old cpu, the 64-bit version, with 3 or 4-way split, should do   
   >>> it in less than twice that time.   
   >>   
   >> 30 cycles is impossible. The below code is branch-free and needs 152   
   >> cycles (tested on AMD Bulldozer) without formatting. Leaving out the   
   >> formatting and other specific stuff not required for the given task,   
   >> you could reduce execution time to about 120 clocks, but surely not   
   >> less.   
   >   
   > Read again, I never claimed 30 cycles for 64-bit, but that something less   
   than 60 should be possible   
   > on a cpu with more than 2 or 3-way superscalarity.   
      
      
   Ditto - the posted code uses the 32 portions of 64 bit registers,   
   so the 152 clocks cannot be the number for a 64 bit conversion. I   
   do not know how many ways of superscalarity Bulldozer has, but it   
   was the predecessor of Ryzen, so it shouldn't be too outdated.   
      
      
   >> (We have to wait for the result of the MulDiv to get a remainder for   
   >> the next step, so it is impossible to serialise processing.)   
   >   
   > I didn't even try to understand your algorithm from the (mostly) uncommented   
   asm code, can you   
   > please explain it?   
      
      
   Of course, just tell me which part of it. The entire function has   
   523 lines of code and actually is two functions (Q2dec and D2dec)   
   packed into one. It converts either 64 or 32 bit hex numbers into   
   a (optionally formatted) decimal ASCII string (with the option to   
   insert a pseudo FP and re-format the string). To extract decimals   
   from hex, the input number is divided (MulDiv) by 10E10 or 10E20,   
   the result is stored as ASCII digit and as well multiplied by the   
   power of ten used as divisor. This result then is subtracted from   
   the input number to get the remainder.   
      
   Q2dec() repeats this step nine times and continues with D2dec()'s   
   code, while D2dec() skips the not required 64 bit part and starts   
   with 10E10 and performs eight consecutive MulDiv-divisions (digit   
   number 10 (or 20) is the last remainder). Depending on the flags,   
   the string then either is formatted with three posible separators   
   (blank, dot or comma) or passed unprocessed. A pseudo FP might be   
   inserted, as well.   
      
   Depending on the passed flags, both functions will treat input as   
   signed or unsigned. Regardless of the optional formatting, output   
   always is aligned to the right side and an optional sign preceeds   
   the proper amount of blanks to provide proper alignment of powers   
   of ten in multi-line lists.   
      
      
   > The way my algorithm works (for 32-bit) is like this:   
   >   
   > Do a reciprocal mul in order to divide by 100 000, then backmultiply and   
   subtract in order to get   
   > the two parts: This is two serial MULs.   
   >   
   > The rest of the algorithm can run in parallel for the two halves:   
   >   
   > Scale the number (another MUL) so that it ends up as a fixed-point m.nnnnn   
   value with the first   
   > digit (m) in the top 4 bits, extract that digit with a SHR 28 and save it to   
   the top slot.   
   >   
   >    REPM 4   
   >      mask away the previous high digit   
   >      LEA reg,[reg+reg*4] to multiply by 5   
   >      Extract the new digit with a shift value one less (27,26,25,24)   
   >    ENDM   
   >   
   > When a MUL takes 4 cycles we have 3 serial MULs and a set of single-cycle   
   SHR/AND/LEA operations   
   > that perfectly overlap the processing of the two halves.   
   >   
   > For 64 bit I have considered both a 4-way split, i.e. wrapping two instances   
   of the 32-bit code into   
   > an initial split along 1e10, this would require 64-bit operations on the low   
   half which could   
   > contain values up to 9.999..e9   
   >   
   > The alternative is to split 3 ways, with 7 digits in each part, this could   
   be a little bit faster   
   > since it matches better with the current number of execution units in modern   
   cpus, but I haven't   
   > checked carefully that a scaled 7-digit value would work correctly for all   
   possible inputs (using   
   > just 32-bit ops). However since the number of inputs is so low a full   
   exhaustive test would be quick   
   > to do.   
   >   
   > With a 3-way split I was thinking that it should be possible to overlap some   
   of the work:   
   >   
   >    top13 = x / 10000000LL;        ;; ~5 cycles via reciprocal MUL   
   >    top6 = x / 100000000000000LL;        ;; One additional cycle   
   >   
   >    bottom7 = x - top13*10000000LL;    ;; ~5 cycles   
   >    middle7 = top13 - top6*10000000LL;    ;; One additional cycle   
   >   
   >   
   >    top6 *= scale;            ;; ~4 cycles   
   >    bottom7 *= scale;            ;; +1   
   >    middle7 *= scale;            ;; +1   
   >   
   > So at this point we have used about 18 cycles, add 3 more for each   
   additional cycle of MUL latency   
   > above 4.   
   >   
   > Past this point we're back in SHR/AND/LEA territory but with three pipes in   
   parallel.   
      
      
   I'm not a math genius, so it's all Greek to me... ;)   
      
   I surely have to read this another 1,000 times until I get a clue   
   how to write a working implementation. I am not really happy with   
   my current solution (too slow for my taste), but it still is fast   
   compared to most hex2dec() functions I have tested. (Which do not   
   offer my sophisticated formatting options - most skip the leading   
   zeroes, and thusly pass an ugly left aligned string).   
      
   If you risk a closer look: My code generally is optimised for AMD   
   Bulldozer's four pipe design. I always try to feed all pipes with   
   appropriate instructions to reduce idling caused by dependencies.   
   This also should speed up processor designs with less pipes.   
      
      
   Greetings from Augsburg   
      
   Bernhard Schornak   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca