... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.asm.x86
Ahh, the lost art of x86 assembly
4,675 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 2,747 of 4,675
Rod Pemberton to James Harris
Re: Optimize stricmp() algorithm (casele
27 Jun 17 23:31:03
   From: NeedNotReplyHere@nospicedham.xrsevnneqk.cem   
      
   On Wed, 28 Jun 2017 01:42:31 +0100   
   James Harris  wrote:   
      
   > On 27/06/2017 22:48, Rod Pemberton wrote:   
   > > On Tue, 27 Jun 2017 09:51:06 +0100   
   > > James Harris  wrote:   
      
   > > I kept the temporaries (local   
   > > variables) to encourage the compiler to place them in registers.   
   >   
   > Looking at the generated code, below, it picks up ESI and EDI from   
   > the stack - i.e. they are p0 and p1. It keeps them unchanged and it   
   > uses them in the body of the code. x and y would, therefore, seem to   
   > be redundant. (Assuming your p0 and p1 could be declared as char *).   
      
   Maybe.  I've not confirmed for this code whether that's needed or not.   
      
   I've generally found with GCC that using auto's "encourages" GCC to   
   keep values in registers, instead of reloading them from the stack or   
   memory.  Sometimes this is optimized away, i.e., unnecessary as you've   
   pointed out may be the situation.  But, it ensures that GCC doesn't   
   generate bad code by default, as is normal.  E.g., GCC is exceptionally   
   bad at selecting when to use 8-bit registers.  It can take plenty of   
   declaration type juggling to fix this.   
      
   > > Is this faster?  I can't tell from comparing the code.   
   >   
   > You know, it's funny how much I've forgotten from specific   
   > optimisation guides. I used to know what the Intel Atom did   
   > differently from others in this regard, for example (there is a   
   > specific difference), but I don't now. But however vague I am on   
   > those specifics I'm pretty sure that address formation has been in   
   > earlier pipeline stages for many generations, and that it has its own   
   > adders. What was the target CPU for your compilation? Your compiler   
   > evidently thought it was a good idea to use [X+Y] in the movsx   
   > instruction for that CPU and above.   
      
   I overlooked this earlier, but the movsx's, by adding the offsets, both   
   *access* the same register, i.e., +rdx (64-bit) or +ebx (32-bit).  I'm   
   not sure if that creates a register dependency or not, as the register   
   being accessed is not modified.  I would assume that it wouldn't, but   
   it might.   
      
   Either way, due to the pairing, pipelining, loading time, decoding time,   
   etc, I'm not sure which is faster without testing.   
      
   > > int rp_xor_stricmp(void *p0, void *p1)   
   > > {   
   > >    int t;   
   > >    char *x,*y;   
   > >   
   > >    x=p0;   
   > >    y=p1;   
   >   
   > Could that section be replaced with this?   
   >   
   >    int rp_xor_stricmp(char *x, char *y)   
   >    {   
   >      char t;   
      
   You can replace lots of things with lots of things, e.g., you could use   
   a for() loop instead of while(1).  You could use *x++ or *sp++ as   
   Rick or Terje do.  (Albeit, this is not recommended due to   
   sequence-point issues.  The post- and pre- increment/decrement   
   operators are always supposed to be alone on a line for ANSI C.)   
      
   Are you asking if it makes a difference in the emitted code for 32-bit   
   or 64-bit? ...  With -O2, 64-bit generates the same code.  With -O0,   
   64-bit routines are similar, except stack offsets and some setup lines.   
   With -O2, 32-bit generates the same code.  With -0O, 32-bit routines   
   are similar, except stack offsets, some setup lines, and an 'lea'   
   instruction which has been relocated.   
      
   So, yes, you can replace it, at least for the GCC compiler, without   
   issue.   
      
   > >    while(1)   
   > >    {   
   > >      if(!((*x)&(*y)))   
   >   
   > Do you need to test both chars?   
      
   How do you stop when either string reaches the End-Of-String (EOS) or   
   nul terminator '\0' without checking both strings for EOS or nul   
   terminator '\0'?   
      
   It's not valid in C to read past the string's EOS or nul terminator.   
   Perhaps, doing so is fine for assembly, if the buffer is large enough   
   to not access other important data.   
      
   FYI, the earlier routine actually checks  *both*  strings for the EOS   
   or nul terminator, with the "if(!a)" statement at the top of the loop,   
   but only after "a==b" on a string match "continues" to the top of the   
   loop.  So, a and b are both '\0' for a match when they reach "if(!a)".   
   I.e., only one needs to be checked.  When a or b is '\0' for a   
   non-match, the routine will hit the "break;" at the end of the loop,   
   and never loop to reach "if(!a)" at the top of the loop.   
      
   In this routine, t==0 on any character match, due to the XOR, including   
   '\0'=='\0' for the EOS or nul terminator.  So, you can't use t's value   
   to match '\0'=='\0', unless you read  *past*  the EOS or nul terminator   
   of at least one of the strings.  This wouldn't be valid for C code,   
   i.e., undefined behavior.   
      
   Do you see a way to only check one string for the EOS without   
   potentially reading past the end of the other string?  If so, I'd be   
   interested in seeing that solution.   
      
   > >        break;   
   > >      t=(*x)^(*y);   
   > >      x++;   
   > >      y++;   
   > >      if(!t)   
   > >        continue;   
   > >      if(t==0x20)   
   > >        continue;   
   >   
   > t being 0x20 means that they /might/ match. You would still need to   
   > use lower() (or lower[]) to check.   
      
   t being 0x20 means means a lowercase alphabetic matches an uppercase   
   alphabetic character, or vice-versa, at least for the ASCII character   
   set.  Alphabetic characters are all 0x20 (32) apart in ASCII.   
      
   A-Z is 0x41 (65) to 0x5A (90).   
   a-z is 0x61 (97) to 0x7A (122).   
   0x41 | 0x20 = 0x61.   
   0x5A | 0x20 = 0x7A.   
      
   t being 0x20 wouldn't be a match when dealing with graphics characters,   
   i.e., non-alphanumeric.  In that case, lower() won't fix the issue   
   either.  lower() only works on alphabetic characters.  You'd need to add   
   a range check or an isgraph() call etc.  IIRC, the results of isgraph()   
   is not consistent or accurate across various C compiler platforms for   
   ASCII.  I.e., you might consider constructing your own from other   
   ctype.h functions.   
      
   I.e., while unconfirmed by testing, the routine above should work for   
   alphabetic characters and digits, if I didn't make any mistakes.  Then   
   again, maybe it doesn't.  As always, it's up you to test it, if you use   
   it.  Generally, I'd run the entire ASCII set through such a routine to   
   see what it emits with a judiciously placed printf().  (Did I get word   
   of the day again?)   
      
   > >      break;   
   > >    }   
   > >    x--;   
   > >    y--;   
   > >    return((*x)-(*y));   
   > > }   
   >   
   > With array notation you would be able to finish with   
      
   s/array notation/use of the subscript operator/   
      
   (Remember, despite what those self-proclaimed "C experts" over on c.l.c.   
   say, there are  *NO*  arrays in C.  C only has an array declaration that   
   allocates space for storage and declares a pointer to it.  All use of   
   "arrays" in the C language are actually use of the "subscript   
   operator" [] .  The subscript operator works with any pointer to a   
   valid type, i.e., non-void, and an offset.  Hence, no arrays in C.)   
      
   >    return x[i] - y[i];   
   >   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]