On Sat, 11 Nov 2017 08:48:02 -0700, "James Van Buskirk"   
    wrote:   
      
   >Looking at your code more carefully, it seems to consist of three   
   >blocks like:   
   >   
   >mov rax, rdi   
   >and rax, [mem]   
   >mul [mem]   
   >sub rdi, rdx   
   >   
   These parts of the code   
      
    mul qword ptr [c_1] # 3 2 2 1 3 2   
    sub rdi,rdx # 1 1 x x x 1 0-33   
    mov rax,rdi # 1 1 x x x 1 0.33   
    and rax,qword ptr [m_2] # 1 1 x x x 1 1   
      
   I figured fit that rule: ?   
   #####   
   Instruction decoding   
      
   The decoders can handle four instructions per clock cycle, or five in   
   the case of macro-op fusion. Only the first one of the four decoders   
   can handle instructions that generate more than one ľop. The minimum   
   output of the decoders is therefore 2 uops per clock cycle in   
   the case that all instructions generate 2 uops each so that only the   
   first decoder can be used. Instructions may be ordered according to   
   the 4-1-1-1 pattern for optimal decoder throughput.   
   #####   
   I tried to verify this by inserting RET at the places with these   
   lines:   
   # ? cycles   
   for instance:   
   RET   
   # 6 cycles   
   gives an output of ~6,000,000.   
   >There is a fourth block in there, but since it doesn't write rdi,   
   < ...   
   >to devise some more tests to check these possibilities out.   
      
   >Also, a million iterations is quite a lot and could catch an   
   >interrupt quite frequently. Normally I would do a lot less   
      
   I assemble and link it in geany and when I click the execute button   
   directly in the GUI the output varies very often, so I think it   
   catches many, but in an X-terminal there are very few differences in   
   the output. It mostly reads ~12,000,000 with an occasional higher one.   
      
   >iterations -- 100 would be plenty -- and store the results   
   >in an array. After doing a small number of timings, 10 or   
   >less, print out the array. Make some sort of judgment   
   >about the noise you see in your data. And print out the   
   >raw results of your program: not everyone has a Nehalem   
   >running 64-bit linux for direct verification.   
   >   
   I should have told that I actually run this in a guest "Linux debian   
   4.9.0-3-amd64" in virtualbox on a Windows 10 host, and the cpu has   
   3.4GHz, but AFAICT the guest does run in real time (not sure though).   
   --   
   aen   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|