home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,760 of 4,675   
   Bernhard Schornak to Terje Mathisen   
   Re: Stack management strategies (1/2)   
   05 Jan 19 07:59:46   
   
   From: schornak@nospicedham.web.de   
      
   Terje Mathisen wrote:   
      
      
   > Bernhard Schornak wrote:   
   >> R.Wieser wrote:   
   >>   
   >>> Terje,   
   >>>   
   >>>> As long as you are using a post-1986 CPU you can use stack-relative   
   >>>> adressing, in which case EBP is perfectly usable as a regular register,   
   >>>   
   >>> I know, and I'm sure bernard knows that as well.   
   >>   
   >> Yes. What I called "Intelligent Design" is code for recent   
   >> processors, not for hardware shown in museums. I developed   
   >> it before iNTEL came up with the less sophisticated, down-   
   >> graded version they publish in their 'optimisation guides'   
   >> since a couple of years.   
   >>   
   >> And, since I research this in depth for more than a decade   
   >> now, I do know (and can prove it experimentally) that this   
   >> kind of stack management is faster than abusing rBP. As it   
   >   
   > This is where you are wrong!   
      
      
   Am I? Implies you are an entity empowered to determine what *has   
   to be* wrong and what *has to be* right...   
      
      
   > Not specifically, i.e. using a single ESP update followed by MOV is probably   
   faster on many cpus,   
   > but not in general:   
   >   
   > Simply because long series of PUSH/POP are so common in both compiler and   
   assembler code, cpu   
   > architects have a lot of good reasons for trying to make this sort of code   
   significantly faster, and   
   > they almost certainly will do so. (The most obvious hw optimization is to   
   regard multiple sequential   
   > PUSH or POP operations as a single macro op, this is effectively the same as   
   using explicit MOV   
   > operations while avoiding the code size impact.)   
      
      
   Have a look at AMD and iNTEL optimisation guides regarding write   
   combining (WC). Both tell you to prefer full 64 byte writes when   
   you want to gain immense improvements. The WC logic is triggered   
   by consecutive writes to ascending memory locations. When you're   
   forcing a WC cycle via   
      
      
               .p2align    5,,31   
             0:subq        $0xF8, %rsp   
               movq        %r15,  0x88(%rsp)   
               movq        %r14,  0x90(%rsp)   
               movq        %r13,  0x98(%rsp)   
               movq        %r12,  0xA0(%rsp)   
               movq        %r11,  0xA8(%rsp)   
               movq        %r10,  0xB0(%rsp)   
               movq        %rbp,  0xB8(%rsp)   
      
   it is much faster than   
      
               .p2align    5,,31   
             0:movq        %rsp,  %rbp   
               push        %r15   
               push        %r14   
               push        %r13   
               push        %r12   
               push        %r11   
               push        %r10   
               push        %rbx             # replaces RBP above   
               subq        $0xC0, %rsp   
      
   because the PUSHes work downwards, preventing the internal logic   
   from switching to a WC sequence. If properly coded, the PUSHless   
   version combines those 7 register saves into one write sequence,   
   updating the corresponding cache line in one gulp. The last pro-   
   cessor generations all have optimised prefetch mechanisms, so it   
   is no problem to feed them with lengthy instructions, as long as   
   the code to be executed internally can be reduced to simple exe-   
   cution blocks. A MOVe always is a basic operation, while PUSH or   
   POP are complex operations (move register or data and update rSP   
   before / after moving).   
      
      
   > In the meantime you have to ask yourself:   
      
      
   Do I? Sorry, but I hate psycho-games... ;) Your sentence implies   
   "You are wrong and I am right!". If you mean that, just write it   
   down as clear text instead of beating around the bush.   
      
      
   > Did you measure these speedups in smaller micro benchmarks, or as part of a   
   substantial code base?   
   > The reason I'm asking is because time and time again it turns out that   
   smaller code is faster code!   
      
      
   My test suite for OS/2 (until 2009) was   
      
            .align 2,0x90   
      _test:subl $0x80, %esp   
            nop   
            nop   
            movl %edx,  0x68(%esp)   
            movl %ecx,  0x6C(%esp)   
            movl %ebx,  0x70(%esp)   
            movl %edi,  0x74(%esp)   
            movl %esi,  0x78(%esp)   
            movl %ebp,  0x7C(%esp)   
            ...   
            do something to let the load/store unit retire   
            ...   
            movl 0x68(%esp), %edx   
            movl 0x6C(%esp), %ecx   
            movl 0x70(%esp), %ebx   
            movl 0x74(%esp), %edi   
            movl 0x78(%esp), %esi   
            movl 0x7C(%esp), %ebp   
            addl $0x80,      %esp   
            ret   
      
   versus   
      
            .align 2,0x90   
      _test:movl %esp,  %ebp   
            push %edx   
            push %ecx   
            push %ebx   
            push %edi   
            push %esi   
            push %ebp   
            ...   
            do something to let the load/store unit retire   
            ...   
            pop  %edx   
            pop  %ecx   
            pop  %ebx   
            pop  %edi   
            pop  %esi   
            pop  %ebp   
            leave   
            ret   
      
   tested on Athlon and Phenom and for 64 bit Windows   
      
           .p2align    5,,31   
      _test:subq $0xF8, %rsp   
            movq %r15,  0xA0(%rsp)   
            movq %r14,  0xA8(%rsp)   
            movq %r13,  0xB0(%rsp)   
            movq %r12,  0xB8(%rsp)   
            movq %r11,  0xC0(%rsp)   
            movq %r10,  0xC8(%rsp)   
            movq %rbx,  0xD0(%rsp)   
            movq %r9,   0xD8(%rsp)   
            movq %r8,   0xE0(%rsp)   
            movq %rdx,  0xE8(%rsp)   
            movq %rcx,  0xF0(%rsp)   
            ...   
            do something to let the load/store unit retire   
            ...   
            movq 0xA0(%rsp), %r15   
            movq 0xA8(%rsp), %r14   
            movq 0xB0(%rsp), %r13   
            movq 0xB8(%rsp), %r12   
            movq 0xC0(%rsp), %r11   
            movq 0xC8(%rsp), %r10   
            movq 0xD0(%rsp), %rbx   
            movq 0xD8(%rsp), %r9   
            movq 0xE0(%rsp), %r8   
            movq 0xE8(%rsp), %rdx   
            movq 0xF0(%rsp), %rcx   
            addq $0xF8,      %rsp   
            ret   
      
   versus   
      
           .p2align    5,,31   
      _test:movq %rsp,  %rbp   
            push %r15   
            push %r14   
            push %r13   
            push %r12   
            push %r11   
            push %r10   
            push %rbx   
            push %r9   
            push %r8   
            push %rdx   
            push %rcx   
            subq $0xC0, %rsp   
            ...   
            do something to let the load/store unit retire   
            ...   
            pop  %r15   
            pop  %r14   
            pop  %r13   
            pop  %r12   
            pop  %r11   
            pop  %r10   
            pop  %rbx   
            pop  %r9   
            pop  %r8   
            pop  %rdx   
            pop  %rcx   
            leave   
            ret   
      
   tested on Phenom, Bulldozer and Ryzen.   
      
   Turned out the two NOPs at the beginning (initially thought as a   
   filler for the remaining execution pipes) were superfluous. Both   
   (32 and 64 bit) tests show advantages for the PUSHless versions.   
   The gain for the 32 bit version is larger, so I guess that iNTEL   
   and AMD 'tuned' their schedulers to detect PUSH/POP sequences to   
   apply some exeptional handling.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca