home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,062 of 4,675   
   aen@nospicedham.spamtrap.com to not_valid@nospicedham.comcast.net   
   Re: cycles   
   11 Nov 17 19:35:22   
   
   On Sat, 11 Nov 2017 08:48:02 -0700, "James Van Buskirk"   
    wrote:   
      
   >Looking at your code more carefully, it seems to consist of three   
   >blocks like:   
   >   
   >mov rax, rdi   
   >and rax, [mem]   
   >mul [mem]   
   >sub rdi, rdx   
   >   
   These parts of the code   
      
     mul     qword ptr [c_1]     #  3   2   2        1        3  2   
     sub     rdi,rdx             #  1   1   x  x  x           1 0-33   
     mov     rax,rdi             #  1   1   x  x  x           1 0.33   
     and     rax,qword ptr [m_2] #  1   1   x  x  x  1           1   
      
   I figured fit that rule: ?   
   #####   
   Instruction decoding   
      
   The decoders can handle four instructions per clock cycle, or five in   
   the case of macro-op fusion. Only the first one of the four decoders   
   can handle instructions that generate more than one ľop. The minimum   
   output of the decoders is therefore 2 uops per clock cycle in   
   the case that all instructions generate 2 uops each so that only the   
   first decoder can be used. Instructions may be ordered according to   
   the 4-1-1-1 pattern for optimal decoder throughput.   
   #####   
   I tried to verify this by inserting RET at the places with these   
   lines:   
   # ? cycles   
   for instance:   
   RET   
   # 6 cycles   
   gives an output of ~6,000,000.   
   >There is a fourth block in there, but since it doesn't write rdi,   
   < ...   
   >to devise some more tests to check these possibilities out.   
      
   >Also, a million iterations is quite a lot and could catch an   
   >interrupt quite frequently.  Normally I would do a lot less   
      
   I assemble and link it in geany and when I click the execute button   
   directly in the GUI the output varies very often, so I think it   
   catches many, but in an X-terminal there are very few differences in   
   the output. It mostly reads ~12,000,000 with an occasional higher one.   
      
   >iterations -- 100 would be plenty -- and store the results   
   >in an array.  After doing a small number of timings, 10 or   
   >less, print out the array.  Make some sort of judgment   
   >about the noise you see in your data.  And print out the   
   >raw results of your program: not everyone has a Nehalem   
   >running 64-bit linux for direct verification.   
   >   
   I should have told that I actually run this in a guest "Linux debian   
   4.9.0-3-amd64" in virtualbox on a Windows 10 host, and the cpu has   
   3.4GHz, but AFAICT the guest does run in real time (not sure though).   
   --   
   aen   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca