Efficient strings

This topic might sound a bit old to experienced guys, but eh when you see how much simple apps obviously rape the memory, i thought that wouldn't do harm to set my view about it.

i am hereby talking about a C++-based technique to minimize the actual use of memory especially when you rely so much about strings, CString, or any other character buffer that might hold depending on the language and base classes you are using.


why be efficient ?

There is no question that low cost memory is no good excuse for bad programming. Effective use of object oriented programming in general produces side effects like memory hog, so there is got to be ways to have apps running with less requirement, especially when the apps are expected on "old" PCs, those used by your next customers!

So let me introduce strings, probably the most intensively used object in whatever app we might talking about.

What's bad with the C run-time ? it provides all the necessary (asm) routines to perform string processing such as comparisons, concatenations, etc. but it does not handle memory buffers at all. So developers using the C run-time end up messing their code with memory allocation everywhere and, because we are talking memory pointers here, pointers appearing everywhere in the code. As a result, the source code is made unstable by design since the use of pointers usually mess up the app over time.

That's why the MFC CString is such a goodness. It frees the developer from the tedious task of managing the memory. But it has a major side effect : strings are being allocated/deallocated quite often, in fact proportionally ; local copies are created anytime the = operator is being used and, as a result, you are raping the memory with a lot of questionable activity. Oh, it's not your code, but you are using MFC CString objects which were not meant for intensive use.

The interesting point is thus to come up with a slight modification of the CString implementation and have a global dictionnary of allocated strings. Thanks to this dictionnary it's possible to figure out from a buffer whether it's already known somewhere, especially whether it needs a new allocation or not. This drastically changes the performance of an app, proportionally to the use of CString objects.

Globally managed strings is the route taken by .NET String objects by the way, and it's known under the immutable strings vocable.


unmanaged to managed strings

A couple years ago, I needed CString facilities without binary dependency of the MFC. So what I did was extract the relevant code, and then add it to my own core library. It was easy a task, most of the code is in MFC \ strcore.cpp. This resulted in a AString class definition and implementation. Note the different name for the class, only to avoid conflicts when it comes to integrating such code into MFC apps.

When you take a look at the CString implementation you are quite stumped by the fact that CString object instances are being (internally) duplicated for many reasons and as a result can cause a lot of harm to performance. In fact, the Copy() method is used quite often : construction time, assignment time, string manipulation, and so on.

The first thing to do is to reengineer the CString implementation so that a Copy() method is the only method which eventually does allocate memory buffers, regardless of the method being called. Doing so, we ensure that changing the allocation of the object will only result in a single method implementation change, which is great as a thought!

So next to the CString class, we have a globally created hash table instance. The instance is created to be able to be accessed anytime from anywhere in the code. This again simplifies matters. Otherwise, we should have passed the hash table as a parameter of an arbitrary method, and it would have required some serious sequencing.

The hash table is a fast-access table. It basically gives access to buffers and calculates a hash key from each string to build a faked index where each string is going to be stored. Using the key instead of relying on an iterator is really what brings performance. Iterators is basically what you end up to with raw and stupid object oriented code.

The good thing of a hash table is that it brings the opportunity to store strings uniquely, and that's why it's a such a good idea to integrate a hash table into a CString implementation.

The funny thing is that, again, it only takes a change in the Copy() method implementation, the ultimate method being used in string processing.

Below is a link to that source code. The string implementation is for a class called AString. Everything else is simple enough, at least seems simple enough to me!

Link to source code (15kb, MFC dialog-based project)



Stéphane Rodriguez - August 29, 2003.