From: anton@mips.complang.tuwien.ac.at   
      
   EricP writes:   
   >Anton Ertl wrote:   
   >> When I compile   
   >>   
   >> long foo(long a, long b)   
   >> {   
   >> if (a+b<0)   
   >> return a-b;   
   >> else   
   >> return a*b;   
   >> }   
   >>   
   >> with gcc-12.2.0 -O -c on AMD64, I get   
   >>   
   >> 0000000000000000 :   
   >> 0: 48 89 f8 mov %rdi,%rax   
   >> 3: 48 89 fa mov %rdi,%rdx   
   >> 6: 48 01 f2 add %rsi,%rdx   
   >> 9: 78 05 js 10    
   >> b: 48 0f af c6 imul %rsi,%rax   
   >> f: c3 ret   
   >> 10: 48 29 f0 sub %rsi,%rax   
   >> 13: c3 ret   
   ...   
   >This could be 1 MOV shorter.   
   >It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.   
   >Just ADD %rsi,%rdi and after that use the %rax copy.   
      
   Yes, I often see more register-register moves in gcc-generated code   
   than necessary.   
      
   >For that optimization { ADD CMP Bcc } => { ADD Bcc }   
   >to work those three instructions must be adjacent.   
   >In this case it wouldn't make a difference but in general   
   >I think they would want the freedom to move code about and not have   
   >the ADD bound to the Bcc too early so this would have to be about   
   >the very last optimization so it didn't interfere with code motion.   
      
   Yes, possible. When I look at what clang-14.0.6 -O -c produces, it's   
   this:   
      
   0000000000000000 :   
    0: 48 89 f9 mov %rdi,%rcx   
    3: 48 29 f1 sub %rsi,%rcx   
    6: 48 89 f0 mov %rsi,%rax   
    9: 48 0f af c7 imul %rdi,%rax   
    d: 48 01 fe add %rdi,%rsi   
    10: 48 0f 48 c1 cmovs %rcx,%rax   
    14: c3 ret   
      
   clang seems to prefer using cmov. The interesting thing here is that   
   it puts the add right in front of the cmovs, after the code for "a-b"   
   and "a*b". When I do   
      
   long foo(long a, long b)   
   {   
    if (a+b*111<0)   
    return a-b;   
    else   
    return a*b;   
   }   
      
   clang produces this code:   
      
   0000000000000000 :   
    0: 48 6b ce 6f imul $0x6f,%rsi,%rcx   
    4: 48 89 f8 mov %rdi,%rax   
    7: 48 29 f0 sub %rsi,%rax   
    a: 48 0f af f7 imul %rdi,%rsi   
    e: 48 01 f9 add %rdi,%rcx   
    11: 48 0f 49 c6 cmovns %rsi,%rax   
    15: c3 ret   
      
   I.e., rcx=b*111 is first, but a+rcx is late, right before the cmovns.   
   So it seems to have some mechanism for keeping the add and the   
   cmov(n)s as one unit.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
    Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|