Simple xml

This article presents the source code for a really simple xml parser, just under 100 lines of C/C++ code.

Many things tend to revolve around xml these days. It looks like pretty much everyone involved with file formats, configuration files, RSS feeds, and countless other uses of markup languages tend to want to use xml for that matter. Of course, there is a lot of trendiness in that, often a regular comma-separated data file will do just fine. Yet you need an xml parser anytime you are required to read data vehicled with xml.

For that matter, full-fledged parsers which also happen to include xslt engines and the like are absolutely overkill. MSXML, Apache Xerces, or the .NET Xml parser are three bad examples of that. Actually, rely on a simple xml parser will be just fine.

Worse than that, nobody in practice uses the entire set of capabilities of xml, like processing instructions and entities. Most of the time, people will use UTF-8 as the encoding since it's the best combination between universality and character size, but that's not even a requirement.

In some cases, people don't use attributes at all. After all, attributes are special kind of element children or, the opposite, direct element children can be regarded as attributes as well. There is still that on-going debate between attributers and elementers...

Finally, it was some kind of a personal challenge to come up with a parser that wouldn't be larger than say 100-200 lines of code. After all, why should xml require anything more if xml is *that* simple?

Features and limitations

The code reads regular xml with the following limitations :

the parser doesn't support entities other than & < > and "
the parser doesn't support DTD declarations

With that in hand, it looks like our parser is pretty dumb. Nevertheless, it will meet your needs for the simple reason that all of the limitations above are features often disregarded anyway.

The xml parser produces a document object model (DOM), that is a hierarchy of tree nodes which a client code can navigate.

Implementation details

The parsing process does really nothing more than keeping track of reserved symbols as < and >. Whenever the parser is on top of a < symbol followed by a /, it stores the value that may have been declared, as in <element>myvalue</element>, otherwise the / lets the parser know for the current node level whether a paired ending element is reached, and that is time for going one level up, or if a child element begins, in which case, a new node must be created. Since an element can have more than one child, we must create siblings whenever required.

The result of the parse is entirely encapsulated by the node hierarchy. At each level of the node defined below, we find an arbitrary amount of siblings and children. And then we have a parent node.


struct Node {

  typedef enum NodeType { elem = 0, attr };

  NodeType type;
  std::string name;
  std::string value;
  LPNode parent, child, sibling, attrib;

  Node(NodeType t) : type(t)
  {
    name = value = NULL;
    parent = child = sibling = attrib = NULL;
  }

  ~Node()
  {
    delete child; child = NULL;
    delete sibling; sibling = NULL;
    delete attrib; attrib = NULL;

  }

};

How the parser can be used is really straight forward :


  #include  "simplexml.h"

  char* xml = "..."; // some xml fragment

  // parse the xml fragment
  LPNode dom = ReadXml(xml);
  if (!dom)
    return FALSE;

  // dump the node tree
  DumpDom(dom,0);

  // remember to delete the resulting dom
  delete dom;

History

August 10, 2004 - First implementation. The parser does not support attributes and comments.
July 3, 2005 - Added attributes, comments and the BOM (Byte-Order-Mark). Also uses STL strings.
Oct 12, 2006 - Fixed a problem in the BOM. Fixed UTF8 characters handling.

Stéphane Rodriguez - Oct 12, 2006.

Home
Blog