... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c++.moderated
Moderated discussion of C++ superhackery
33,346 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 32,156 of 33,346
Ulrich Eckhardt to All
Re: Find/replace function
18 Apr 12 05:02:00
   0034716b   
   From: ulrich.eckhardt@dominolaser.com   
      
   Am 18.04.2012 02:05, schrieb Jimmy H.:   
   > string replace(const string &input,   
   >                const string &to_find,   
   >                const string &replacement) {   
   >     auto end = input.end();   
   >     string result;   
   >     for(auto it = input.begin(); it < end; it++) {   
   >         if(*it == to_find[0] && equal(to_find.begin(),   
   >                                       to_find.end(),   
   >                                       it)) {   
   >             result += replacement;   
   >             it += to_find.length() - 1;   
   >         } else {   
   >             result += *it;   
   >         }   
   >     }   
   >     return result;   
   > }   
   [...]   
   > Three questions:   
   > (a) Are there any glaring (or even not glaring) inefficiencies   
      
   There are a few:   
   1. You repeatedly re-allocate the storage for the result string while   
   concatenating. I would simply assume that the output size is roughly the   
   input size and preallocate the same amount of storage. First scanning   
   for to_fing and from that determining the exact size is also something   
   worth testing, but scanning the sequence twice could eat up all the   
   benefits of not reallocating the storage.   
   2. You are rolling your own strstr() function basically and not reaping   
   the benefits of optimized implementations. For that reason, I wouldn't   
   use iterators but the index-based string functions. With that, just use   
   string::find() to locate the string. Then, append the whole piece   
   between the current position and where the string is found in one,   
   instead of adding each character separately. Make sure you have tests in   
   place for this, it's easy to miss corner cases.   
   3. You return by value. Most compilers will get this right and optimize   
   it out.   
   4. I don't understand why you are skipping "to_find.length() - 1" steps   
   forward. The sequence [it, it+length) is exactly equal to to_find, so   
   the next possible occurrence is at it+length. Add a tests that replaces   
   the string "fouf" with "barb" inside "foufouf". You must get "barbouf"   
   as result!   
      
   Note: Since you are after the performance, you need benchmarks, no way   
   around this.   
      
      
   > points of bad style in this implementation?   
      
   1. Even for strings, I would use size() instead of length().   
   2. Use prefix increment (++it) instead of postfix increment (it++).   
   3. Use equality comparison (it!=end) instead of less-than (it Is there a standard library function, outside of C++11 regexes,   
   > that basically does this for me?   
      
   Not that I was aware of.   
      
      
   > (b) Are there any easy (or even hard) ways to make this faster,   
   > without using the regular expression library?   
      
   I don't thing that the regex lib would necessarily make this faster,   
   although regex libs could be more optimized than a hand-rolled solution.   
      
   A hard way to make this faster could be to optimize out unnecessary   
   copying, using e.g. a rope instead of a string, or a linked list of   
   COW-optimized strings. This isn't guaranteed though, and the additional   
   overhead could eat up the benefits.   
      
   Another way would be to skip the transformation until you really need   
   the data. If you consume the output string sequentially, only processing   
   one piece at a time of the input sequence would reduce the memory   
   overhead which would directly influence the CPU cache use. I don't know   
   much about your use case, but I'm envisioning that you are first reading   
   the sequence from a file (or socket), then transforming it and then   
   writing it to another files. This means scanning the same data three   
   times. Processing a chunk of data before taking on the next would be   
   better here. Still, in that case your bottleneck is probably the IO   
   anyway, which you could solve by parallelizing. I don't know what you   
   are doing, so it's hard to tell.   
      
      
   > Would a destructive   
   > version that modifies the input string be faster?   
      
   If you know that you are only shrinking the string (i.e. to_find is   
   longer than the replacement), then yes. The reason is locality of the   
   memory, which is better for CPU caches.   
      
      
   > (c) How would I do this using the new C++11 regex facility, and would   
   > that be faster?   
      
   Sorry, I have to pass on this one.   
      
      
   Good luck!   
      
   Uli   
      
      
   --   
         [ See http://www.gotw.ca/resources/clcm.htm for info about ]   
         [ comp.lang.c++.moderated.    First time posters: Do this! ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]