Fast string search

This article provides an insight on performance over efficient algorithms meant to search strings within text (exact matching), and uses different flavor of languages including C, C++, C# and Java. The results are astonishing.

What for?

Why spend time on string search? Isn't this supposed to be provided by the core run-times? Is it worth any investment? To all those questions the answer is pretty clear : it depends on what you are trying to achieve. Clearly, if you only want to provide search capabilities in the text edit component of your app, then you might just as well be off with the existing API available for you. Regardless the language you are using, you will always find basic algorithms for just that, so why bother?

Bothering begins when you need high performing processes. It's utterly fine you can find content among your myriad of documents, but if this takes as much as 10 minutes before you get the results page, then you won't do it often. On the web, the success of google is just as much the relevant character of results as well as the response time. So yes, for some application it matters. If you are writing parsers, it matters. Almost 100% developers I can think of have to write parsers at one time or another in the development life cycle, so why not take the best algorithm for that matter, or why not rely on an efficient library?

The additional reason for that article is that I wanted to benchmark technologies as they are evolving. Java introduced a new full object orientation in the mid 90s, and now Microsoft is taking this and reintroducing it in the .NET development platform. So it matters a lot, imagine that Java or .NET equivalent of the former algorithms are 10 or 20 slower, doesn't it hurt the adoption of these technologies? Why should people, including developers and business people invest in technologies that only add bloat to problems trying to get solved? In other words, does Java and .NET really add up anything? Are they part of the problem, or part of the solution?

Introducing algorithms

Algorithms used are either standard, that is part of standard run-time libraries, or deliberately resulting from individual implementations. In fact, none of those individual implementations are optimized, the code was left as simple as it could be as a way to be able to assess the brute force comparison with highly optimized ones from standard libraries.

No less than 4 languages have been used. Although Java and .NET code internally use Unicode to store text, they arguably compare with the C/C++ single-byte algorithms I have used since both can look-up characters arbitrarily, unlike UTF8 for instance.

For the C language, I have used strstr from the MS C run-time. It's written with assembly, that is optimized code. In other words, it uses x86 registers instead of memory stacks or memory data segs any time it is possible. Although assembly code looks odd these days, no one (except teams working on games may be) would ever rely on that language anymore, it has produced tremendous results in the tests I have performed.

For the C++ language, I have used 3 algorithms. They are not optimized. The first is naive string search, it looks characters one after the other, until the whole text to search is all traversed. The second is based on Donald Knuth - Morris - Pratt work. The third is based on Boyer - Moore work. The second and third algorithms deserve explanations, that you'll find here (iteration by iteration tutorial) and here (explanations, time/memory cost, descriptive algorithm, C++ algorithm). In the remainder of the article, the second algorithm shall be referenced by KMP, while the third algorithm shall be by BM. In essence, KMP takes advantage of former lookups to skip characters, but characters in the text to perform the search are never skipped. BM pushes that beyond by skipping characters whenever possible, for instance any time a character within reach is not part of the pattern being searched for, then the whole range of characters is skipped. I have deliberately not used variants of those algorithms, although they provide improvements. C++ algorithms provide two interesting comparisons : are clever algorithms worth it? is the C++ compiler able to produce code equivalent to the highly optimized C routine?

In Java, I have used the default and standard way to look for a pattern within a text. That's implemented by the indexOf method in the System.String class (base class library). The interesting bits is that Sun actually provides the actual source code for the Java VM, so it's possible to know the actual code being executed, in addition to comparing results.

For .NET, I have used C# even though I could have taken any other .NET language. I rely on the String class from the base class library. I have used the IndexOf method, just as for Java. The actual code is not provided so it's only possible to compare resulting figures. But I went a bit beyond that by looking up the actual code from Rotor, the shared-source implementation of the .NET VM and (partial) BCL. I did two executions, one with a normal build, one with ngen'ed code (pregenerated native image).

The test suite

The test suite consists in choosing one algorithm and executing it a great amount of times, great enough so that the cost of initializing the variables is not significant. For instance, the time required to start the Java VM or the CLR VM is not taken into account (although it can account in the real world!). In the downloadable code, the test suite executes 1000 unit tests.

The unit test consists in finding all occurences of 5 patterns in a text. The text is chosen to be generic enough so not to influence one algorithm or another. It's 48kb, providing sufficient lookup iterations to allow algorithms to play at their best.

Patterns are chosen to allow further interesting comparisons. In terms of occurences, the patterns are as below :

pattern	occurences
my pattern	0
IPropertyBag	1
WebClient	6
developer	12
the	323

Astonishing results

The test suite was executed on a Celeron 800Mhz with 256MB of RAM. No other process was running at the same time. Of course, the test suite was executing a release build of the code, for all languages.

Results are below. The number of the duration for the test suite to perform 1000 times all unit tests.

algorithm	duration (millisecs)
C, strstr	101
C++, naive	310
C++, Knuth-Morris-Pratt	331
C++, Boyer-Moore	210
Java, string.indexOf	361
C#, String.IndexOf	2083
C# ngen'ed, String.IndexOf	1993

First impressions

Of course results may vary with your machine if you do the test, but what matters is the ratio between algorithms, not the actual absolute value.

For a simple and often used function call, the .NET code (whether being ngen'd or not) is 20 times slower than the most optimized (and most brutal) native code. So performance leaves a lot to desire.

Other conclusions are :

Java is much faster than .NET code
Productivity in .NET programming should not be obtained at the expense of performance.
No matter the algorithm, the raw and brutal assembly code always wins. The reason behind this is that it uses registers as much as it can instead of memory access.
The resulting duration is the total for all unit tests, which means that, for C++ algorithms, the duration does not necessarily reflect the actual performance of one algorithm in some particular cases. For instance, the Boyer-Moore algorithm which is often regarded as one of the most efficient for fast string search is certainly faster in some special cases.
Regarding the obvious false claims of Redmond, their C++ compiler is way behind. Algorithms like Knuth and Boyer stand way behind the simplest and crude assembly strstr implementation. It's a total shame. They stand behind being 2 or 3 times slower. And even though the most stupid algorithm (naive) gets executed at the same speed than those being much clever.
The C++ compiler team do a very POOR job at compiling the naive, Knutt and Boyer-Moore core into assembly machine code. At least, the resulting code does not seem to reflect a compiler used worldwide for its merits.

Looking at the code

.1 strstr implementation

findnext:
        mov     esi,edi             ; esi = edi = pointers to somewhere in str1
        mov     ecx,[esp + 14h]     ; str2

;use edi instead of esi to eliminate AGI
        mov     al,[edi]            ; al is next char from str1

        add     esi,1               ; increment pointer into str1

        cmp     al,dl
        je      first_char_found

        test    al,al               ; end of str1?
        jz      not_found           ; yes, and no match has been found

loop_start:
        mov     al,[esi]            ; put next char from str1 into al
        add     esi,1               ; increment pointer in str1
in_loop:
        cmp     al,dl
        je      first_char_found

        test    al,al               ; end of str1?
        jnz     loop_start          ; no, go get another char from str1

not_found:
        pop     esi
        pop     ebx
        pop     edi
        xor     eax,eax
        ret

If you don't know assembly code, let me tell you in short that closed brackets reflect memory access. The more memory access, the slower the code. Even though the register keyword can be added in plain C/C++ code to instruct the compiler to try to use registers as much as possible, it obviously didn't or couldn't achieve what the human-written implementation of strstr achieved.

.2 C++ asm code generated for the naive implementation

(only an excerpt of the code, two lines of C++ code are dissected)

; 207  : 		LPTSTR p = m_cpPattern;

  00030	8b 45 f8	 mov	 eax, DWORD PTR _this$[ebp]
  00033	8b 48 10	 mov	 ecx, DWORD PTR [eax+16]
  00036	89 4d ec	 mov	 DWORD PTR _p$64824[ebp], ecx

; 208  : 		long i = 0;

  00039	c7 45 e0 00 00
	00 00		 mov	 DWORD PTR _i$64825[ebp], 0
$L64827:

; 209  : 		while (*(m_cpPtr+i) && *p == *(m_cpPtr+i))

  00040	8b 45 f8	 mov	 eax, DWORD PTR _this$[ebp]
  00043	8b 48 0c	 mov	 ecx, DWORD PTR [eax+12]
  00046	8b 55 e0	 mov	 edx, DWORD PTR _i$64825[ebp]
  00049	0f be 04 11	 movsx	 eax, BYTE PTR [ecx+edx]
  0004d	85 c0		 test	 eax, eax
  0004f	74 2b		 je	 SHORT $L64828
  00051	8b 45 ec	 mov	 eax, DWORD PTR _p$64824[ebp]
  00054	0f be 08	 movsx	 ecx, BYTE PTR [eax]
  00057	8b 55 f8	 mov	 edx, DWORD PTR _this$[ebp]
  0005a	8b 42 0c	 mov	 eax, DWORD PTR [edx+12]
  0005d	8b 55 e0	 mov	 edx, DWORD PTR _i$64825[ebp]
  00060	0f be 04 10	 movsx	 eax, BYTE PTR [eax+edx]
  00064	3b c8		 cmp	 ecx, eax
  00066	75 14		 jne	 SHORT $L64828

.3 Rotor code (non commercial .NET VM and base class library)

INT32 COMNlsInfo::FastIndexOfString(WCHAR *source, INT32 startIndex, INT32 endIndex, 
                                    WCHAR *pattern, INT32 patternLength) 
{    
  int endPattern = endIndex - patternLength + 1;        
  if (endPattern<0) 
  {
    return -1;    
  }    
  
  if (patternLength <= 0) 
  {
    return startIndex;    
  }
  
  WCHAR patternChar0 = pattern[0];    
  
  for (int ctrSrc = startIndex; ctrSrc<=endPattern; ctrSrc++) 
  {        
    if (source[ctrSrc] != patternChar0)            
	  continue;        
	  
    int ctrPat;
    for (ctrPat = 1; 
	       (ctrPat < patternLength) && (source[ctrSrc + ctrPat] == pattern[ctrPat]); 
		     ctrPat++) {          
    ; // just loop
    }
	
    if (ctrPat == patternLength) 
    {
      return (ctrSrc);        
    }
  }
 
  return (-1);
}

.4 Java code (actual VM and base class library)


  static int indexOf(char[] source, int sourceOffset, int sourceCount,
                       char[] target, int targetOffset, int targetCount,
                       int fromIndex) {
	if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
	}
    	if (fromIndex < 0) {
    	    fromIndex = 0;
    	}
	if (targetCount == 0) {
	    return fromIndex;
	}

        char first  = target[targetOffset];
        int i = sourceOffset + fromIndex;
        int max = sourceOffset + (sourceCount - targetCount);

    startSearchForFirstChar:
        while (true) {
	    /* Look for first character. */
	    while (i <= max && source[i] != first) {
		i++;
	    }
	    if (i > max) {
		return -1;
	    }

	    /* Found first character, now look at the rest of v2 */
	    int j = i + 1;
	    int end = j + targetCount - 1;
	    int k = targetOffset + 1;
	    while (j < end) {
		if (source[j++] != target[k++]) {
		    i++;
		    /* Look for str's first char again. */
		    continue startSearchForFirstChar;
		}
	    }
	    return i - sourceOffset;	/* Found whole string. */
        }
    }

Stéphane Rodriguez
December 22, 2003.

Home
Blog