... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.asm.x86

Ahh, the lost art of x86 assembly

4,675 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 3,059 of 4,675

James Van Buskirk to All

Re: cycles

11 Nov 17 08:48:02

   From: not_valid@nospicedham.comcast.net   
      
   wrote in message news:5a064e0e.24212140@NNTP.AIOE.ORG...   
      
   > Since I use it only at the start and the end of the code, and it runs   
   > a million times, the overhead of RDTSC and not completed instructions   
   > can be neglected I think.   
      
   Ah, I missed that part.  Since you didn't provide raw output   
   I kind of glossed over your code.   
      
   > That's why i run the code 1,000,000 times, which doesn't make it so   
   > short after all.   
      
   > The timings I made with RDTSC on a Pentium have been very consistent   
   > with the tables in Agner Fog's Manual, but then the underlying   
   > processor architecture was much simpler.  That makes it so very hard   
   > on the newer cpus to follow what's going on underneath :-(   
      
   Looking at your code more carefully, it seems to consist of three   
   blocks like:   
      
   mov rax, rdi   
   and rax, [mem]   
   mul [mem]   
   sub rdi, rdx   
      
   There is a fourth block in there, but since it doesn't write rdi,   
   it doesn't contribute to overall latency.  Each instruction is   
   dependent on each previous instruction, so one would think   
   the latency would add up to 6*3 = 18 clock cycles.  Surprising   
   that MUL m64 has lower latency than MUL m32.   
      
   4*3 = 12 clock cycles might be believable if the MOV   
   instruction could be handled by register renaming and   
   the latency on rdx in the MUL instruction were only 2 clock   
   cycles.  I usually do timing on throughput-limited floating   
   point code rather than latency-limited integer code so I have   
   more limited experience on such issues.  You might want   
   to devise some more tests to check these possibilities out.   
      
   Also, a million iterations is quite a lot and could catch an   
   interrupt quite frequently.  Normally I would do a lot less   
   iterations -- 100 would be plenty -- and store the results   
   in an array.  After doing a small number of timings, 10 or   
   less, print out the array.  Make some sort of judgment   
   about the noise you see in your data.  And print out the   
   raw results of your program: not everyone has a Nehalem   
   running 64-bit linux for direct verification.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]