From: terje.mathisen@nospicedham.tmsw.no   
      
   Rod, this is almost certainly faster in real-life situations, very nice   
   catch. :-)   
      
   For unicode it is even nicer because the overhead of monocasing is so   
   much larger, and the extra overhead of doing an extra compare of the raw   
   values is close to zero (maximum is the cost of the branch instruction   
   itself).   
      
   I have also considered a sort-of-hybrid approach where you write a   
   flipcase() function, i.e. given a char it returns the opposite case if   
   that is defined:   
      
   int caseless_compare(const void *p1, const void *p2)   
   {   
    wchar a, b;   
    const unsigned wchar *s1 = p1;   
    const unsigned wchar *s2 = p2;   
      
    do {   
    while ((a = *s1++) == (b = *s2++)) {   
    if (!a) return 0;   
    }   
    } while (flipcase(a) == b);   
      
   // I have to monocase before I can know which wchar is larger!   
    return wtolower(a) - wtolower(b);   
   }   
      
   The savings here is that I do just one flipcase() call for each   
   non-matching position, but then I have to do two wide wtolower() calls   
   to determine the actual order.   
      
   I also considered using the result of flipcase(): It is only when that   
   is different from the input that we would need to monocase the other   
   char, right?   
      
   Except one crucial detail here:   
      
   How is stricmp() defined?   
      
   Does it always force uppercase/force lowercase before comparing?   
      
   The problem occurs because there are ascii characters which are located   
   between 'Z' and 'a', right?   
      
   These will be sorted either before or after an alphabetic char depending   
   on the case used for the compare, all other chars gives the same result.   
   ...   
   A bit of googling seems to indicate that the function is defined to   
   lowercase all chars _before_ the compare, so the OP had it right, and   
   the fastest sort will almost certanly come from a separate lowercased   
   set of strings!   
      
   Terje   
      
   Rod Pemberton wrote:   
   > On Sat, 24 Jun 2017 08:03:26 -0700 (PDT)   
   > "Rick C. Hodgin" wrote:   
   >   
   >> Can anybody help me optimize this code?   
   >>   
   >> [link]   
   >>   
   >> It doesn't use the stack for parameters, but two global variables   
   >> called leftptr and righptr which hold the input values. It does   
   >> return in the eax register, so it can be used normally that way   
   >> for assignment.   
   >>   
   >> It's designed to be used as a custom assembly algorithm for a   
   >> stricmp() algorithm which follows this general pattern, which   
   >> is designed to be the target of a qsort() callback:   
   >>   
   >> int caseless_compare(const void *p1, const void *p2)   
   >> {   
   >> int d;   
   >> const unsigned char *s1 = p1;   
   >> const unsigned char *s2 = p2;   
   >>   
   >> while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)   
   >> s1++, s2++;   
   >>   
   >> return d;   
   >> }   
   >>   
   >> Note that this algorithm returns negative, 0, or positive values,   
   >> which is different than other strcmp() algorithms which return -1,   
   >> 0, 1, but it can still work generically in this case because it's   
   >> the return result to qsort(), which only checks negative, positive,   
   >> or equal.   
   >>   
   >> I tried variations of the branch direction in my asm code, and this   
   >> one I posted was the best of those I tried.   
   >>   
   >> There's room for improvement, but I don't know what to look for.   
   >> Right now it is faster than Visual Studio 2015 can generate in   
   >> 32-bit code for that caseless_compare() block above. It scores   
   >> a value of 131 in my testing, and the assembly version I created   
   >> scores 107, with the time being in milliseconds on this test.   
   >>   
   >> Any help is appreciated. Thank you in advance.   
   >   
   > Rick,   
   >   
   > I'm wondering what happens if you break up the comparison into two   
   > parts. E.g., do a basic character comparison without translation to   
   > lowercase. If that fails, then do a comparison of the characters after   
   > translation to lowercase.   
   >   
   > I'm guessing that breaking the comparison into two would actually   
   > improve the overall speed of comparisons overall, as large portions of   
   > normal text strings will be lower-cased by default, e.g., this sentence   
   > only has one uppercase character. A list of names or lists of   
   > dictionary words will only have the first letter upper-cased also.   
   >   
   > In other words, Terje/Ben algorithm has a fixed instruction cost per   
   > comparison which includes lower-casing every character in the string   
   > when, most likely, many of the characters will not need to be   
   > lower-cased for a proper comparison. I.e., a variable rate of   
   > comparison might be quicker if the text is mostly lowercase.   
   >   
   > E.g., you might want something more like this (minimally tested):   
   >   
   > int rp_stricmp(void *p0, void *p1)   
   > {   
   > int a,b;   
   > char *x,*y;   
   >   
   > a=!0;   
   > x=p0;   
   > y=p1;   
   >   
   > while(1)   
   > {   
   > if(!a)   
   > break;   
   > a=*x;   
   > b=*y;   
   > x++;   
   > y++;   
   > if(a==b)   
   > continue;   
   > if(lower[a]==lower[b])   
   > continue;   
   > break;   
   > }   
   > return(a-b);   
   > }   
   >   
   > With -O2, 32-bit GCC (DJGPP for DOS) generates:   
   >   
   > .globl _rp_stricmp   
   > _rp_stricmp:   
   > pushl %ebp   
   > movl $1, %ecx   
   > movl %esp, %ebp   
   > pushl %esi   
   > pushl %ebx   
   > movl 12(%ebp), %edx   
   > movl 8(%ebp), %ebx   
   > L10:   
   > testl %ecx, %ecx   
   > je L3   
   > movsbl (%ebx),%ecx   
   > movsbl (%edx),%esi   
   > incl %ebx   
   > incl %edx   
   > cmpl %esi, %ecx   
   > je L10   
   > movb _lower(%esi), %al   
   > cmpb %al, _lower(%ecx)   
   > je L10   
   > L3:   
   > subl %esi, %ecx   
   > popl %ebx   
   > movl %ecx, %eax   
   > popl %esi   
   > popl %ebp   
   > ret   
   >   
   > With -O2, 64-bit GCC (for Linux) generates:   
   >   
   > .globl rp_stricmp   
   > .type rp_stricmp, @function   
   > rp_stricmp:   
   > .LFB15:   
   > .cfi_startproc   
   > pushq %rbx   
   > .cfi_def_cfa_offset 16   
   > .cfi_offset 3, -16   
   > .L2:   
   > movsbl (%rdi), %eax   
   > movsbl (%rsi), %edx   
   > addq $1, %rdi   
   > addq $1, %rsi   
   > cmpl %edx, %eax   
   > je .L4   
   > movslq %edx, %rcx   
   > movslq %eax, %r8   
   > movzbl lower(%rcx), %ebx   
   > cmpb %bl, lower(%r8)   
   > je .L4   
   > .L3:   
   > subl %edx, %eax   
   > popq %rbx   
   > .cfi_remember_state   
   > .cfi_def_cfa_offset 8   
   > ret   
   > .p2align 4,,10   
   > .p2align 3   
   > .L4:   
   > .cfi_restore_state   
   > testl %eax, %eax   
   > jne .L2   
   > .p2align 4,,7   
   > jmp .L3   
   > .cfi_endproc   
   >   
   > I'm not real sure why 64-bit GCC is dropping in a few extra resizing   
   > instructions in the main part of the loop. It appears to me that GCC   
   > is picking incorrect register sizes ... Adjusting the declared types   
   > didn't seem to help any, e.g., int to char, etc.   
   >   
   >   
   > Rod Pemberton   
   >   
      
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|