Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,059 of 4,675    |
|    James Van Buskirk to All    |
|    Re: cycles    |
|    11 Nov 17 08:48:02    |
      From: not_valid@nospicedham.comcast.net              wrote in message news:5a064e0e.24212140@NNTP.AIOE.ORG...              > Since I use it only at the start and the end of the code, and it runs       > a million times, the overhead of RDTSC and not completed instructions       > can be neglected I think.              Ah, I missed that part. Since you didn't provide raw output       I kind of glossed over your code.              > That's why i run the code 1,000,000 times, which doesn't make it so       > short after all.              > The timings I made with RDTSC on a Pentium have been very consistent       > with the tables in Agner Fog's Manual, but then the underlying       > processor architecture was much simpler. That makes it so very hard       > on the newer cpus to follow what's going on underneath :-(              Looking at your code more carefully, it seems to consist of three       blocks like:              mov rax, rdi       and rax, [mem]       mul [mem]       sub rdi, rdx              There is a fourth block in there, but since it doesn't write rdi,       it doesn't contribute to overall latency. Each instruction is       dependent on each previous instruction, so one would think       the latency would add up to 6*3 = 18 clock cycles. Surprising       that MUL m64 has lower latency than MUL m32.              4*3 = 12 clock cycles might be believable if the MOV       instruction could be handled by register renaming and       the latency on rdx in the MUL instruction were only 2 clock       cycles. I usually do timing on throughput-limited floating       point code rather than latency-limited integer code so I have       more limited experience on such issues. You might want       to devise some more tests to check these possibilities out.              Also, a million iterations is quite a lot and could catch an       interrupt quite frequently. Normally I would do a lot less       iterations -- 100 would be plenty -- and store the results       in an array. After doing a small number of timings, 10 or       less, print out the array. Make some sort of judgment       about the noise you see in your data. And print out the       raw results of your program: not everyone has a Nehalem       running 64-bit linux for direct verification.              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca