Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.asm.x86    |    Ahh, the lost art of x86 assembly    |    4,675 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 3,760 of 4,675    |
|    Bernhard Schornak to Terje Mathisen    |
|    Re: Stack management strategies (1/2)    |
|    05 Jan 19 07:59:46    |
      From: schornak@nospicedham.web.de              Terje Mathisen wrote:                     > Bernhard Schornak wrote:       >> R.Wieser wrote:       >>       >>> Terje,       >>>       >>>> As long as you are using a post-1986 CPU you can use stack-relative       >>>> adressing, in which case EBP is perfectly usable as a regular register,       >>>       >>> I know, and I'm sure bernard knows that as well.       >>       >> Yes. What I called "Intelligent Design" is code for recent       >> processors, not for hardware shown in museums. I developed       >> it before iNTEL came up with the less sophisticated, down-       >> graded version they publish in their 'optimisation guides'       >> since a couple of years.       >>       >> And, since I research this in depth for more than a decade       >> now, I do know (and can prove it experimentally) that this       >> kind of stack management is faster than abusing rBP. As it       >       > This is where you are wrong!                     Am I? Implies you are an entity empowered to determine what *has       to be* wrong and what *has to be* right...                     > Not specifically, i.e. using a single ESP update followed by MOV is probably       faster on many cpus,       > but not in general:       >       > Simply because long series of PUSH/POP are so common in both compiler and       assembler code, cpu       > architects have a lot of good reasons for trying to make this sort of code       significantly faster, and       > they almost certainly will do so. (The most obvious hw optimization is to       regard multiple sequential       > PUSH or POP operations as a single macro op, this is effectively the same as       using explicit MOV       > operations while avoiding the code size impact.)                     Have a look at AMD and iNTEL optimisation guides regarding write       combining (WC). Both tell you to prefer full 64 byte writes when       you want to gain immense improvements. The WC logic is triggered       by consecutive writes to ascending memory locations. When you're       forcing a WC cycle via                      .p2align 5,,31        0:subq $0xF8, %rsp        movq %r15, 0x88(%rsp)        movq %r14, 0x90(%rsp)        movq %r13, 0x98(%rsp)        movq %r12, 0xA0(%rsp)        movq %r11, 0xA8(%rsp)        movq %r10, 0xB0(%rsp)        movq %rbp, 0xB8(%rsp)              it is much faster than               .p2align 5,,31        0:movq %rsp, %rbp        push %r15        push %r14        push %r13        push %r12        push %r11        push %r10        push %rbx # replaces RBP above        subq $0xC0, %rsp              because the PUSHes work downwards, preventing the internal logic       from switching to a WC sequence. If properly coded, the PUSHless       version combines those 7 register saves into one write sequence,       updating the corresponding cache line in one gulp. The last pro-       cessor generations all have optimised prefetch mechanisms, so it       is no problem to feed them with lengthy instructions, as long as       the code to be executed internally can be reduced to simple exe-       cution blocks. A MOVe always is a basic operation, while PUSH or       POP are complex operations (move register or data and update rSP       before / after moving).                     > In the meantime you have to ask yourself:                     Do I? Sorry, but I hate psycho-games... ;) Your sentence implies       "You are wrong and I am right!". If you mean that, just write it       down as clear text instead of beating around the bush.                     > Did you measure these speedups in smaller micro benchmarks, or as part of a       substantial code base?       > The reason I'm asking is because time and time again it turns out that       smaller code is faster code!                     My test suite for OS/2 (until 2009) was               .align 2,0x90        _test:subl $0x80, %esp        nop        nop        movl %edx, 0x68(%esp)        movl %ecx, 0x6C(%esp)        movl %ebx, 0x70(%esp)        movl %edi, 0x74(%esp)        movl %esi, 0x78(%esp)        movl %ebp, 0x7C(%esp)        ...        do something to let the load/store unit retire        ...        movl 0x68(%esp), %edx        movl 0x6C(%esp), %ecx        movl 0x70(%esp), %ebx        movl 0x74(%esp), %edi        movl 0x78(%esp), %esi        movl 0x7C(%esp), %ebp        addl $0x80, %esp        ret              versus               .align 2,0x90        _test:movl %esp, %ebp        push %edx        push %ecx        push %ebx        push %edi        push %esi        push %ebp        ...        do something to let the load/store unit retire        ...        pop %edx        pop %ecx        pop %ebx        pop %edi        pop %esi        pop %ebp        leave        ret              tested on Athlon and Phenom and for 64 bit Windows               .p2align 5,,31        _test:subq $0xF8, %rsp        movq %r15, 0xA0(%rsp)        movq %r14, 0xA8(%rsp)        movq %r13, 0xB0(%rsp)        movq %r12, 0xB8(%rsp)        movq %r11, 0xC0(%rsp)        movq %r10, 0xC8(%rsp)        movq %rbx, 0xD0(%rsp)        movq %r9, 0xD8(%rsp)        movq %r8, 0xE0(%rsp)        movq %rdx, 0xE8(%rsp)        movq %rcx, 0xF0(%rsp)        ...        do something to let the load/store unit retire        ...        movq 0xA0(%rsp), %r15        movq 0xA8(%rsp), %r14        movq 0xB0(%rsp), %r13        movq 0xB8(%rsp), %r12        movq 0xC0(%rsp), %r11        movq 0xC8(%rsp), %r10        movq 0xD0(%rsp), %rbx        movq 0xD8(%rsp), %r9        movq 0xE0(%rsp), %r8        movq 0xE8(%rsp), %rdx        movq 0xF0(%rsp), %rcx        addq $0xF8, %rsp        ret              versus               .p2align 5,,31        _test:movq %rsp, %rbp        push %r15        push %r14        push %r13        push %r12        push %r11        push %r10        push %rbx        push %r9        push %r8        push %rdx        push %rcx        subq $0xC0, %rsp        ...        do something to let the load/store unit retire        ...        pop %r15        pop %r14        pop %r13        pop %r12        pop %r11        pop %r10        pop %rbx        pop %r9        pop %r8        pop %rdx        pop %rcx        leave        ret              tested on Phenom, Bulldozer and Ryzen.              Turned out the two NOPs at the beginning (initially thought as a       filler for the remaining execution pipes) were superfluous. Both       (32 and 64 bit) tests show advantages for the PUSHless versions.       The gain for the 32 bit version is larger, so I guess that iNTEL       and AMD 'tuned' their schedulers to detect PUSH/POP sequences to       apply some exeptional handling.                     [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca