... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 2,716 of 4,675
Rick C. Hodgin to Rod Pemberton
Re: Optimize stricmp() algorithm (casele
26 Jun 17 01:45:25
   From: rick.c.hodgin@nospicedham.gmail.com   
      
   On Sunday, June 25, 2017 at 4:05:54 AM UTC-4, Rod Pemberton wrote:   
   > On Sat, 24 Jun 2017 08:03:26 -0700 (PDT)   
   > "Rick C. Hodgin"  wrote:   
   >   
   > > Can anybody help me optimize this code?   
   > >   
   > >     [link]   
   > >   
   > > It doesn't use the stack for parameters, but two global variables   
   > > called leftptr and righptr which hold the input values.  It does   
   > > return in the eax register, so it can be used normally that way   
   > > for assignment.   
   > >   
   > > It's designed to be used as a custom assembly algorithm for a   
   > > stricmp() algorithm which follows this general pattern, which   
   > > is designed to be the target of a qsort() callback:   
   > >   
   > >     int caseless_compare(const void *p1, const void *p2)   
   > >     {   
   > >         int d;   
   > >         const unsigned char *s1 = p1;   
   > >         const unsigned char *s2 = p2;   
   > >   
   > >         while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)   
   > >             s1++, s2++;   
   > >   
   > >         return d;   
   > >     }   
   > >   
   > > Note that this algorithm returns negative, 0, or positive values,   
   > > which is different than other strcmp() algorithms which return -1,   
   > > 0, 1, but it can still work generically in this case because it's   
   > > the return result to qsort(), which only checks negative, positive,   
   > > or equal.   
   > >   
   > > I tried variations of the branch direction in my asm code, and this   
   > > one I posted was the best of those I tried.   
   > >   
   > > There's room for improvement, but I don't know what to look for.   
   > > Right now it is faster than Visual Studio 2015 can generate in   
   > > 32-bit code for that caseless_compare() block above.  It scores   
   > > a value of 131 in my testing, and the assembly version I created   
   > > scores 107, with the time being in milliseconds on this test.   
   > >   
   > > Any help is appreciated.  Thank you in advance.   
   >   
   > Rick,   
   >   
   > I'm wondering what happens if you break up the comparison into two   
   > parts.  E.g., do a basic character comparison without translation to   
   > lowercase.  If that fails, then do a comparison of the characters after   
   > translation to lowercase.   
      
   Good idea.  I'll implement that and do some testing.   
      
   > I'm guessing that breaking the comparison into two would actually   
   > improve the overall speed of comparisons overall, as large portions of   
   > normal text strings will be lower-cased by default, e.g., this sentence   
   > only has one uppercase character.  A list of names or lists of   
   > dictionary words will only have the first letter upper-cased also.   
   >   
   > In other words, Terje/Ben algorithm has a fixed instruction cost per   
   > comparison which includes lower-casing every character in the string   
   > when, most likely, many of the characters will not need to be   
   > lower-cased for a proper comparison.  I.e., a variable rate of   
   > comparison might be quicker if the text is mostly lowercase.   
   >   
   > E.g., you might want something more like this (minimally tested):   
   >   
   > int rp_stricmp(void *p0, void *p1)   
   > {   
   >   int a,b;   
   >   char *x,*y;   
   >   
   >   a=!0;   
   >   x=p0;   
   >   y=p1;   
   >   
   >   while(1)   
   >   {   
   >     if(!a)   
   >       break;   
   >     a=*x;   
   >     b=*y;   
   >     x++;   
   >     y++;   
   >     if(a==b)   
   >       continue;   
   >     if(lower[a]==lower[b])   
   >       continue;   
   >     break;   
   >   }   
   >   return(a-b);   
   > }   
   >   
   > With -O2, 32-bit GCC (DJGPP for DOS) generates:   
   >   
   > .globl _rp_stricmp   
   > _rp_stricmp:   
   > 	pushl	%ebp   
   > 	movl	$1, %ecx   
   > 	movl	%esp, %ebp   
   > 	pushl	%esi   
   > 	pushl	%ebx   
   > 	movl	12(%ebp), %edx   
   > 	movl	8(%ebp), %ebx   
   > L10:   
   > 	testl	%ecx, %ecx   
   > 	je	L3   
   > 	movsbl	(%ebx),%ecx   
   > 	movsbl	(%edx),%esi   
   > 	incl	%ebx   
   > 	incl	%edx   
   > 	cmpl	%esi, %ecx   
   > 	je	L10   
   > 	movb	_lower(%esi), %al   
   > 	cmpb	%al, _lower(%ecx)   
   > 	je	L10   
   > L3:   
   > 	subl	%esi, %ecx   
   > 	popl	%ebx   
   > 	movl	%ecx, %eax   
   > 	popl	%esi   
   > 	popl	%ebp   
   > 	ret   
   >   
   > With -O2, 64-bit GCC (for Linux) generates:   
   >   
   > 	.globl	rp_stricmp   
   > 	.type	rp_stricmp, @function   
   > rp_stricmp:   
   > .LFB15:   
   > 	.cfi_startproc   
   > 	pushq	%rbx   
   > 	.cfi_def_cfa_offset 16   
   > 	.cfi_offset 3, -16   
   > .L2:   
   > 	movsbl	(%rdi), %eax   
   > 	movsbl	(%rsi), %edx   
   > 	addq	$1, %rdi   
   > 	addq	$1, %rsi   
   > 	cmpl	%edx, %eax   
   > 	je	.L4   
   > 	movslq	%edx, %rcx   
   > 	movslq	%eax, %r8   
   > 	movzbl	lower(%rcx), %ebx   
   > 	cmpb	%bl, lower(%r8)   
   > 	je	.L4   
   > .L3:   
   > 	subl	%edx, %eax   
   > 	popq	%rbx   
   > 	.cfi_remember_state   
   > 	.cfi_def_cfa_offset 8   
   > 	ret   
   > 	.p2align 4,,10   
   > 	.p2align 3   
   > .L4:   
   > 	.cfi_restore_state   
   > 	testl	%eax, %eax   
   > 	jne	.L2   
   > 	.p2align 4,,7   
   > 	jmp	.L3   
   > 	.cfi_endproc   
      
   Oh my ... AT&T syntax.  Rod, you'll have to use the -masm=intel if you   
   want me to try to read it. :-)  I have dyslexia and have a hard enough   
   time reading without all the redundant text everywhere.  It is actually   
   very very difficult for me to read AT&T syntax.   
      
   > I'm not real sure why 64-bit GCC is dropping in a few extra resizing   
   > instructions in the main part of the loop.  It appears to me that GCC   
   > is picking incorrect register sizes ...  Adjusting the declared types   
   > didn't seem to help any, e.g., int to char, etc.   
      
   I was surprised I was able to write an algorithm which beat modern C/C++   
   compilers.  I would've expected I could not.  And to be honest, apart   
   from some basic examination of the effect of JMPing on this or !this   
   conditions, I didn't do much to try and create an optimized algorithm,   
   but just to do it simply step-by-step in minimal assembly instructions   
   while maintaining a logical and natural flow through what I was seeing   
   in my mind as the standard C/C++ algorithm.   
      
   I had hoped Terje would jump up on the code and find some LEAs or   
   whatever to make it go much faster. :-)   
      
   Thank you,   
   Rick C. Hodgin   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]