From: NeedNotReplyHere@nospicedham.xrsevnneqk.cem   
      
   On Sat, 24 Jun 2017 08:03:26 -0700 (PDT)   
   "Rick C. Hodgin" wrote:   
      
   > Can anybody help me optimize this code?   
   >   
   > [link]   
   >   
   > It doesn't use the stack for parameters, but two global variables   
   > called leftptr and righptr which hold the input values. It does   
   > return in the eax register, so it can be used normally that way   
   > for assignment.   
   >   
   > It's designed to be used as a custom assembly algorithm for a   
   > stricmp() algorithm which follows this general pattern, which   
   > is designed to be the target of a qsort() callback:   
   >   
   > int caseless_compare(const void *p1, const void *p2)   
   > {   
   > int d;   
   > const unsigned char *s1 = p1;   
   > const unsigned char *s2 = p2;   
   >   
   > while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)   
   > s1++, s2++;   
   >   
   > return d;   
   > }   
   >   
   > Note that this algorithm returns negative, 0, or positive values,   
   > which is different than other strcmp() algorithms which return -1,   
   > 0, 1, but it can still work generically in this case because it's   
   > the return result to qsort(), which only checks negative, positive,   
   > or equal.   
   >   
   > I tried variations of the branch direction in my asm code, and this   
   > one I posted was the best of those I tried.   
   >   
   > There's room for improvement, but I don't know what to look for.   
   > Right now it is faster than Visual Studio 2015 can generate in   
   > 32-bit code for that caseless_compare() block above. It scores   
   > a value of 131 in my testing, and the assembly version I created   
   > scores 107, with the time being in milliseconds on this test.   
   >   
   > Any help is appreciated. Thank you in advance.   
      
   Rick,   
      
   I'm wondering what happens if you break up the comparison into two   
   parts. E.g., do a basic character comparison without translation to   
   lowercase. If that fails, then do a comparison of the characters after   
   translation to lowercase.   
      
   I'm guessing that breaking the comparison into two would actually   
   improve the overall speed of comparisons overall, as large portions of   
   normal text strings will be lower-cased by default, e.g., this sentence   
   only has one uppercase character. A list of names or lists of   
   dictionary words will only have the first letter upper-cased also.   
      
   In other words, Terje/Ben algorithm has a fixed instruction cost per   
   comparison which includes lower-casing every character in the string   
   when, most likely, many of the characters will not need to be   
   lower-cased for a proper comparison. I.e., a variable rate of   
   comparison might be quicker if the text is mostly lowercase.   
      
   E.g., you might want something more like this (minimally tested):   
      
   int rp_stricmp(void *p0, void *p1)   
   {   
    int a,b;   
    char *x,*y;   
      
    a=!0;   
    x=p0;   
    y=p1;   
      
    while(1)   
    {   
    if(!a)   
    break;   
    a=*x;   
    b=*y;   
    x++;   
    y++;   
    if(a==b)   
    continue;   
    if(lower[a]==lower[b])   
    continue;   
    break;   
    }   
    return(a-b);   
   }   
      
   With -O2, 32-bit GCC (DJGPP for DOS) generates:   
      
   .globl _rp_stricmp   
   _rp_stricmp:   
    pushl %ebp   
    movl $1, %ecx   
    movl %esp, %ebp   
    pushl %esi   
    pushl %ebx   
    movl 12(%ebp), %edx   
    movl 8(%ebp), %ebx   
   L10:   
    testl %ecx, %ecx   
    je L3   
    movsbl (%ebx),%ecx   
    movsbl (%edx),%esi   
    incl %ebx   
    incl %edx   
    cmpl %esi, %ecx   
    je L10   
    movb _lower(%esi), %al   
    cmpb %al, _lower(%ecx)   
    je L10   
   L3:   
    subl %esi, %ecx   
    popl %ebx   
    movl %ecx, %eax   
    popl %esi   
    popl %ebp   
    ret   
      
   With -O2, 64-bit GCC (for Linux) generates:   
      
    .globl rp_stricmp   
    .type rp_stricmp, @function   
   rp_stricmp:   
   .LFB15:   
    .cfi_startproc   
    pushq %rbx   
    .cfi_def_cfa_offset 16   
    .cfi_offset 3, -16   
   .L2:   
    movsbl (%rdi), %eax   
    movsbl (%rsi), %edx   
    addq $1, %rdi   
    addq $1, %rsi   
    cmpl %edx, %eax   
    je .L4   
    movslq %edx, %rcx   
    movslq %eax, %r8   
    movzbl lower(%rcx), %ebx   
    cmpb %bl, lower(%r8)   
    je .L4   
   .L3:   
    subl %edx, %eax   
    popq %rbx   
    .cfi_remember_state   
    .cfi_def_cfa_offset 8   
    ret   
    .p2align 4,,10   
    .p2align 3   
   .L4:   
    .cfi_restore_state   
    testl %eax, %eax   
    jne .L2   
    .p2align 4,,7   
    jmp .L3   
    .cfi_endproc   
      
   I'm not real sure why 64-bit GCC is dropping in a few extra resizing   
   instructions in the main part of the loop. It appears to me that GCC   
   is picking incorrect register sizes ... Adjusting the declared types   
   didn't seem to help any, e.g., int to char, etc.   
      
      
   Rod Pemberton   
   --   
   The entire idea that the U.S. government obtained information from deep   
   sources in the Russian government that Putin ordered a cyber campaign   
   to disrupt our democracy is itself dezinformatsiya. Wake up sheeple.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|