... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 2,711 of 4,675
Rod Pemberton to Rick C. Hodgin
Re: Optimize stricmp() algorithm (casele
25 Jun 17 03:53:28
   From: NeedNotReplyHere@nospicedham.xrsevnneqk.cem   
      
   On Sat, 24 Jun 2017 08:03:26 -0700 (PDT)   
   "Rick C. Hodgin"  wrote:   
      
   > Can anybody help me optimize this code?   
   >   
   >     [link]   
   >   
   > It doesn't use the stack for parameters, but two global variables   
   > called leftptr and righptr which hold the input values.  It does   
   > return in the eax register, so it can be used normally that way   
   > for assignment.   
   >   
   > It's designed to be used as a custom assembly algorithm for a   
   > stricmp() algorithm which follows this general pattern, which   
   > is designed to be the target of a qsort() callback:   
   >   
   >     int caseless_compare(const void *p1, const void *p2)   
   >     {   
   >         int d;   
   >         const unsigned char *s1 = p1;   
   >         const unsigned char *s2 = p2;   
   >   
   >         while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)   
   >             s1++, s2++;   
   >   
   >         return d;   
   >     }   
   >   
   > Note that this algorithm returns negative, 0, or positive values,   
   > which is different than other strcmp() algorithms which return -1,   
   > 0, 1, but it can still work generically in this case because it's   
   > the return result to qsort(), which only checks negative, positive,   
   > or equal.   
   >   
   > I tried variations of the branch direction in my asm code, and this   
   > one I posted was the best of those I tried.   
   >   
   > There's room for improvement, but I don't know what to look for.   
   > Right now it is faster than Visual Studio 2015 can generate in   
   > 32-bit code for that caseless_compare() block above.  It scores   
   > a value of 131 in my testing, and the assembly version I created   
   > scores 107, with the time being in milliseconds on this test.   
   >   
   > Any help is appreciated.  Thank you in advance.   
      
   Rick,   
      
   I'm wondering what happens if you break up the comparison into two   
   parts.  E.g., do a basic character comparison without translation to   
   lowercase.  If that fails, then do a comparison of the characters after   
   translation to lowercase.   
      
   I'm guessing that breaking the comparison into two would actually   
   improve the overall speed of comparisons overall, as large portions of   
   normal text strings will be lower-cased by default, e.g., this sentence   
   only has one uppercase character.  A list of names or lists of   
   dictionary words will only have the first letter upper-cased also.   
      
   In other words, Terje/Ben algorithm has a fixed instruction cost per   
   comparison which includes lower-casing every character in the string   
   when, most likely, many of the characters will not need to be   
   lower-cased for a proper comparison.  I.e., a variable rate of   
   comparison might be quicker if the text is mostly lowercase.   
      
   E.g., you might want something more like this (minimally tested):   
      
   int rp_stricmp(void *p0, void *p1)   
   {   
     int a,b;   
     char *x,*y;   
      
     a=!0;   
     x=p0;   
     y=p1;   
      
     while(1)   
     {   
       if(!a)   
         break;   
       a=*x;   
       b=*y;   
       x++;   
       y++;   
       if(a==b)   
         continue;   
       if(lower[a]==lower[b])   
         continue;   
       break;   
     }   
     return(a-b);   
   }   
      
   With -O2, 32-bit GCC (DJGPP for DOS) generates:   
      
   .globl _rp_stricmp   
   _rp_stricmp:   
   	pushl	%ebp   
   	movl	$1, %ecx   
   	movl	%esp, %ebp   
   	pushl	%esi   
   	pushl	%ebx   
   	movl	12(%ebp), %edx   
   	movl	8(%ebp), %ebx   
   L10:   
   	testl	%ecx, %ecx   
   	je	L3   
   	movsbl	(%ebx),%ecx   
   	movsbl	(%edx),%esi   
   	incl	%ebx   
   	incl	%edx   
   	cmpl	%esi, %ecx   
   	je	L10   
   	movb	_lower(%esi), %al   
   	cmpb	%al, _lower(%ecx)   
   	je	L10   
   L3:   
   	subl	%esi, %ecx   
   	popl	%ebx   
   	movl	%ecx, %eax   
   	popl	%esi   
   	popl	%ebp   
   	ret   
      
   With -O2, 64-bit GCC (for Linux) generates:   
      
   	.globl	rp_stricmp   
   	.type	rp_stricmp, @function   
   rp_stricmp:   
   .LFB15:   
   	.cfi_startproc   
   	pushq	%rbx   
   	.cfi_def_cfa_offset 16   
   	.cfi_offset 3, -16   
   .L2:   
   	movsbl	(%rdi), %eax   
   	movsbl	(%rsi), %edx   
   	addq	$1, %rdi   
   	addq	$1, %rsi   
   	cmpl	%edx, %eax   
   	je	.L4   
   	movslq	%edx, %rcx   
   	movslq	%eax, %r8   
   	movzbl	lower(%rcx), %ebx   
   	cmpb	%bl, lower(%r8)   
   	je	.L4   
   .L3:   
   	subl	%edx, %eax   
   	popq	%rbx   
   	.cfi_remember_state   
   	.cfi_def_cfa_offset 8   
   	ret   
   	.p2align 4,,10   
   	.p2align 3   
   .L4:   
   	.cfi_restore_state   
   	testl	%eax, %eax   
   	jne	.L2   
   	.p2align 4,,7   
   	jmp	.L3   
   	.cfi_endproc   
      
   I'm not real sure why 64-bit GCC is dropping in a few extra resizing   
   instructions in the main part of the loop.  It appears to me that GCC   
   is picking incorrect register sizes ...  Adjusting the declared types   
   didn't seem to help any, e.g., int to char, etc.   
      
      
   Rod Pemberton   
   --   
   The entire idea that the U.S. government obtained information from deep   
   sources in the Russian government that Putin ordered a cyber campaign   
   to disrupt our democracy is itself dezinformatsiya.  Wake up sheeple.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]