... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.asm.x86

Ahh, the lost art of x86 assembly

4,675 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 3,907 of 4,675

antispam@nospicedham.math.uni.wroc. to All

Prologue and epilogue

12 Jul 19 01:15:45

   There was recently discussion about speed of various   
   ways of saving and restoring registers.  I did a little   
   microbenchmark, saving and restorin 7 registers in   
   different ways:   
      
   a - save and restore using moves in ascending order   
   b - save using moves in descending order, restore in ascending order   
   c - save using pushes, restore using moves in ascending order   
   d - save and restore using pushes and pops   
      
   Called function had 17 instructions (two arithmethic instructions   
   for stack adjustments, 14 save/restore instructions and return).   
   Version b (with descending stores and ascending loads):   
      
   foo1:   
      subq  $0x78, %rsp   
      movq  %rbp,  0x70(%rsp)   
      movq  %r10,  0x68(%rsp)   
      movq  %r11,  0x60(%rsp)   
      movq  %r12,  0x58(%rsp)   
      movq  %r13,  0x50(%rsp)   
      movq  %r14,  0x48(%rsp)   
      movq  %r15,  0x40(%rsp)   
      
      movq  0x40(%rsp), %r15   
      movq  0x48(%rsp), %r14   
      movq  0x50(%rsp), %r13   
      movq  0x58(%rsp), %r12   
      movq  0x60(%rsp), %r11   
      movq  0x68(%rsp), %r10   
      movq  0x70(%rsp), %rbp   
      addq  $0x78, %rsp   
      
   Version a had stores in opposite order, vesrsion c replaced stores   
   by pushed and moved stack adjustment after pushed, version d   
   additionaly replaced loads by pops and moved stack adjustment   
   befor pushes.   
      
   This function was called from loop consisting of 3 instructions:   
      
     4003d0:       e8 2b 01 00 00          callq  400500    
     4003d5:       48 83 eb 01             sub    $0x1,%rbx   
     4003d9:       75 f5                   jne    4003d0    
      
   (that was actually from C code).  So critical loop has 20   
   instructions and 16 memory transfers (7 data stores, pushing   
   return address, 7 data load + reading return address).   
      
   I tested on 1.7 GHz i5 and on 1.60 GHz Celeron N3060.   
   On i5 all versions took 8 clock per loop interation with   
   smal error (less than 2%).  On Celeron versions a, b and c   
   take each 16 clocks (with very small error).  Version d   
   needs 18 clocks.  Removing stack adjustments from version d   
   reduced time to 17 clocks.  So, at least on modern Intel   
   processors differences between moves and pushes are very   
   small.   
      
   Of course this is very naive benchmark and and only covers   
   two processor types.   
      
   --   
                                 Waldek Hebisch   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]