Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,170 of 4,675    |
|    Bernhard Schornak to Terje Mathisen    |
|    Re: Palindromic number    |
|    08 Dec 17 15:43:48    |
      From: schornak@nospicedham.web.de              Terje Mathisen wrote:                     > Bernhard Schornak wrote:       >> Terje Mathisen wrote:       >       >>> For 32-bit unsigned that algorithm uses about 30 cycles on a 10+       >>> year old cpu, the 64-bit version, with 3 or 4-way split, should do       >>> it in less than twice that time.       >>       >> 30 cycles is impossible. The below code is branch-free and needs 152       >> cycles (tested on AMD Bulldozer) without formatting. Leaving out the       >> formatting and other specific stuff not required for the given task,       >> you could reduce execution time to about 120 clocks, but surely not       >> less.       >       > Read again, I never claimed 30 cycles for 64-bit, but that something less       than 60 should be possible       > on a cpu with more than 2 or 3-way superscalarity.                     Ditto - the posted code uses the 32 portions of 64 bit registers,       so the 152 clocks cannot be the number for a 64 bit conversion. I       do not know how many ways of superscalarity Bulldozer has, but it       was the predecessor of Ryzen, so it shouldn't be too outdated.                     >> (We have to wait for the result of the MulDiv to get a remainder for       >> the next step, so it is impossible to serialise processing.)       >       > I didn't even try to understand your algorithm from the (mostly) uncommented       asm code, can you       > please explain it?                     Of course, just tell me which part of it. The entire function has       523 lines of code and actually is two functions (Q2dec and D2dec)       packed into one. It converts either 64 or 32 bit hex numbers into       a (optionally formatted) decimal ASCII string (with the option to       insert a pseudo FP and re-format the string). To extract decimals       from hex, the input number is divided (MulDiv) by 10E10 or 10E20,       the result is stored as ASCII digit and as well multiplied by the       power of ten used as divisor. This result then is subtracted from       the input number to get the remainder.              Q2dec() repeats this step nine times and continues with D2dec()'s       code, while D2dec() skips the not required 64 bit part and starts       with 10E10 and performs eight consecutive MulDiv-divisions (digit       number 10 (or 20) is the last remainder). Depending on the flags,       the string then either is formatted with three posible separators       (blank, dot or comma) or passed unprocessed. A pseudo FP might be       inserted, as well.              Depending on the passed flags, both functions will treat input as       signed or unsigned. Regardless of the optional formatting, output       always is aligned to the right side and an optional sign preceeds       the proper amount of blanks to provide proper alignment of powers       of ten in multi-line lists.                     > The way my algorithm works (for 32-bit) is like this:       >       > Do a reciprocal mul in order to divide by 100 000, then backmultiply and       subtract in order to get       > the two parts: This is two serial MULs.       >       > The rest of the algorithm can run in parallel for the two halves:       >       > Scale the number (another MUL) so that it ends up as a fixed-point m.nnnnn       value with the first       > digit (m) in the top 4 bits, extract that digit with a SHR 28 and save it to       the top slot.       >       > REPM 4       > mask away the previous high digit       > LEA reg,[reg+reg*4] to multiply by 5       > Extract the new digit with a shift value one less (27,26,25,24)       > ENDM       >       > When a MUL takes 4 cycles we have 3 serial MULs and a set of single-cycle       SHR/AND/LEA operations       > that perfectly overlap the processing of the two halves.       >       > For 64 bit I have considered both a 4-way split, i.e. wrapping two instances       of the 32-bit code into       > an initial split along 1e10, this would require 64-bit operations on the low       half which could       > contain values up to 9.999..e9       >       > The alternative is to split 3 ways, with 7 digits in each part, this could       be a little bit faster       > since it matches better with the current number of execution units in modern       cpus, but I haven't       > checked carefully that a scaled 7-digit value would work correctly for all       possible inputs (using       > just 32-bit ops). However since the number of inputs is so low a full       exhaustive test would be quick       > to do.       >       > With a 3-way split I was thinking that it should be possible to overlap some       of the work:       >       > top13 = x / 10000000LL; ;; ~5 cycles via reciprocal MUL       > top6 = x / 100000000000000LL; ;; One additional cycle       >       > bottom7 = x - top13*10000000LL; ;; ~5 cycles       > middle7 = top13 - top6*10000000LL; ;; One additional cycle       >       >       > top6 *= scale; ;; ~4 cycles       > bottom7 *= scale; ;; +1       > middle7 *= scale; ;; +1       >       > So at this point we have used about 18 cycles, add 3 more for each       additional cycle of MUL latency       > above 4.       >       > Past this point we're back in SHR/AND/LEA territory but with three pipes in       parallel.                     I'm not a math genius, so it's all Greek to me... ;)              I surely have to read this another 1,000 times until I get a clue       how to write a working implementation. I am not really happy with       my current solution (too slow for my taste), but it still is fast       compared to most hex2dec() functions I have tested. (Which do not       offer my sophisticated formatting options - most skip the leading       zeroes, and thusly pass an ugly left aligned string).              If you risk a closer look: My code generally is optimised for AMD       Bulldozer's four pipe design. I always try to feed all pipes with       appropriate instructions to reduce idling caused by dependencies.       This also should speed up processor designs with less pipes.                     Greetings from Augsburg              Bernhard Schornak              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca