Efficient object comparison

Stéphane Rodriguez - Aug 8, 2004

What this article is about

In the remainder of this article, I shall discuss object comparison, and how it is possible to take advantage of the way strings get compared to extend the mechanism to arbitrary objects. Although the code provided is made with C++, the principles applies to whatever language. The goal is to show how recurrent operations like comparing objects can be more automated and can indeed serve as a gateway to more efficient comparisons both in terms of maintenance cost, code change cost and last but not least raw CPU performance.

String comparison

This has been covered in a previous article, but we'll get back to the key point. When a newbie first learns C/C++ and creates strings, he makes the mistake to try and compare the content of the strings by just comparing their pointers. Since C/C++ doesn't have a built-in string class per se, and everything boils down to pointers, the following code is obviously flawed :

char* string1 = "hello";
char* string2 = "hello";

// compare the strings, purposedly flawed code
if (string1 == string2)
  printf("strings are the same");
else
  printf("strings are different");

To make a long story short, a pool of strings solves that problem by allocating new buffers only whenever new strings are used. If we replace the previous code with the following :

char* string1 = StringManager.GetPtr("hello");
char* string2 = StringManager.GetPtr("hello");

// compare the strings, this code is ok!
if (string1 == string2)
  printf("strings are the same");
else
  printf("strings are different");

The trick is to delegate string management to a StringManager object whose job is to maintain a pool of unique string buffers. Given that by design the string manager won't allocate twice a buffer for a given string, strings can thus be compared by just comparing their pointers. Ironically enough, the basic newbie assumption now turns out to be perfectly valid, only as a result of introducing an object abstraction, essentially making a string behave like an object.

Whenever you need case insensitive comparisons, a very likely usage indeed, then you need to instruct the string manager to store both versions of buffers, one as is, and one with all cases off. In terms of client code, it boils down to use .GetCaselessPtr() rather than .GetPtr().

How we go from string comparison to object comparison is by reusing the mechanism and serialize objects as strings before we can compare them.

Object comparison

Why object serialization becomes tricky is because of their arbitrary complexness. An object can have a bunch of underlying properties all of a different type, dates, bool, integer, float, and other objects which are also made of arbitrary properties and so on. In the most general case, serialization of objects can be costly in performance that's why developers may prefer to rely on hash codes generated whenever some object is created. Comparison is then performed by just comparing hash codes, typically numbers. Of course, this works well if hash codes are guaranteed to be unique at generation time and that objects can be referenced rather than copied by value. In some scenario, 99% uniqueness granted is not enough and indeed can cause serious bugs in the program, especially database related applications, but for most cases an almost unique hash code generator will meet the needs.

If this isn't clear enough, hash codes can be the serialized strings that reflect the complete object state, and in this case it indeed matches the string comparison described above. This is what we'll concentrate on since string-serialized objects have a bunch of other nice properties we can use for partial comparison, or property change tracking. Of course, hash codes being generated as part of the object creation suffer a major limitation, there is a new hash code for each new object and thus two objects will be automatically said to be different based on their hash codes. The serialized strings however lets you compare different instances of objects and get to know whether these are identical, ie the values of their properties are identical.

It must be said at this point that the following may be useful for your application needs, or not. It really depends on each case, and obviously design choices at the basis of managed code environments like .NET objects is also subject to this : their assumptions around object comparison and hash code generation will fit your needs...or not. Don't underestimate this when you're checking the requirements.

Let's take an example,

class Style
{
 public :
  String _fontname;
  Int _color;
  Int _height;
  EnumStyle _style; // italic, bold, underline combination

  String Serialize()
  {
    return "1.0" /*for versioning purposes*/ + SEP + 
           _fontname + SEP + 
           _color.Serialize() + SEP + 
           _height.Serialize() + SEP +
           _style.Serialize();
  }

  bool Compare(Style& style)
  {
    return ( StringManager.GetPtr( this->Serialize() ) ==
             StringManager.GetPtr( style.Serialize() )
           );
  }

  bool CompareCustom(Style& style, String (*customSerializationFunction)(String&) )
  {
    return ( StringManager.GetPtr( (*customSerializationFunction)(*this) ) == 
             StringManager.GetPtr( (*customSerializationFunction)(style) ) 
           ); 
  }

}

In a real world use case, the Style class comparison allows to know whether the user changes some style in the application, whether it fits an existing one, or if a new instance must be created in the dictionary of known styles. The implementation really revolves around the serialization as defined above, and then ends up comparing strings.

For obvious reasons, if comparisons is intensively used, then there might be a memory problem. Every single call to the StringManager will by default allocate new buffers to different strings. Depending on the distribution of the strings, what the application does, a different implementation of the StringManager is probably more suitable. The alternative implementations are one of the following :

expect a true,false parameter to be passed along in the .GetPtr() call so that no new string buffers are allocated
allocate a string into a large preallocated block, whose size is such that reallocation is rarely required, and promote string buffers to more durable individual memory blocks based on usage. Essentially, strings that are often used would be automatically moved to individual buffers, while others with shorter life would remain temporarily.

Partial comparison

Partial object comparison can be achieved using a custom serialization. Because the class can't or shouldn't implement every possible combination of serialization, for instance I might be interested in all styles with a particular _fontname regardless the values of other properties, it makes complete sense to delegate the serialization to the consumer. So the consumer is responsible for providing a serialization function, which then is used regularly to compare strings just like the full comparison. You end up using the CompareCustom() method.

Home
Blog