CS 340-Fall 2007: Programming Assignment 1

First Phase: Due September 14, 11:59 PM.
Second Phase: Due October 5, 11:59 PM.

(This document was last modified on September 24 at 9:30 PM.)


For this assignment, you are to design two different XML parsing APIs, using different styles, and then implement the APIs.

You may use either C++ or Java for this assignment.

1 Simple XML

For this assignment, you will use a simplified form of XML, containing only start-tags, end-tags, content, and attributes. The format is given below, by example.

< elem1 >
  This is some content.
  <elem2 attr1="value1">Content.</elem2>
  <  elem2 attr1 = "value1" ></  elem2  >
  Some more content.
  <elem3/>
</elem1>

The element <elem3/> is what we call an empty element. It is completely equivalent to <elem3></elem3>.

White space may be present at a number of different points, as shown above. White space can consist of spaces, tabs, or newlines. Note that the content includes any whitespace. For example, the content of this element is exactly a newline followed by two spaces.

<elem>
  </elem>

2 APIs

Any XML parser needs to present its functionality via an API of some kind. There are two popular styles, one is tree-based, and the other is callback-based. You will design and implement both.

In both cases, your API only needs to support accessing an XML document. It does not need to support constructing the actual objects in the API. Of course, there must still be some way of creating those objects, but it can consist of private methods and/or public methods known only to your parser implementation.

2.1 Tree-Based

Tree-based APIs are essentially data structures for tree-like data. They present the XML document as a tree of nodes. They provide a way of representing the document tree, and a way of traversing it. Note that an element may have two kinds of child nodes. The first kind is character content, which would be a string. The second kind is a child element. These kinds may be mixed within the children of a single element.

Elements may also have attributes. Some way should be provided to access the set of attributes of each element.

Your API should also provide some way to access the parent of a given element.

As an example of what such an API is like, you can look at the DOM (Level 1) API given here. Wikipedia also provides a good introduction.

2.2 Callback-Based

A problem with tree-based APIs is that they can use a lot of memory, especially for large XML documents. As a solution, people have also designed streaming, callback-based APIs. In this style, the application provides callbacks to the parser, which are then called by the parser when it recognizes specific pieces. No complete representation of the whole XML document ever needs to be in memory at any single point in time.

You should include callbacks to:

Empty elements, as described above, should be handled as if they consisted of a start-tag and end-tag pair with attributes and no content.

Your API will include some class to represent your streaming parser. You also need some way to initialize this parser with the callbacks provided by the application.

SAX is the best known example of a callback-based API. Details of SAX can be found here at the official site, and here at Wikipedia.

3 First Phase

Due September 14, 11:59 PM.

For the first phase of this project, you will turn in your API designs, one for each style. This should include enough information so that a user can call and use your parsers, and avoid any syntax or runtime errors.

This design will include the C++/Java classes/interfaces that the user will need. It does not need to include private or other methods that the user does not need to call, but that may be required by your implementation. So it will likely not include all methods in those classes or interfaces. It should also include some explanation of what each method does.

Your design should also include error-handling. You may use exceptions or other mechanisms.

3.1 Submission and Evaluation of First Phase

E-mail the design document to cs340-internal@cs.binghamton.edu. It should include some rationale and discusssion, and the API. I do not expect this document to total more than about 4 pages for both APIs, though yours may turn out to be shorter or longer.

Evaluation will consist of how easy the API is to use, and the clarity and completeness of your document.

4 Second Phase

Due October 5, 11:59 PM.

For the second phase of this project, you will implement your APIs, after I approve and possibly modify them. Note that you should try to share as much of your code as possible between the two API implementations.

For reference, or if you did not turn in Phase 1, you may refer to the API given below. In Phase 2, you are free to modify your APIs submitted in Phase 1.

Interfaces

class Node

This class is a base class. It is polymorphic, and so will work with RTTI.

class Element inherits from Node

This class represents XML elements. It inherits from the Node class. RTTI can be used to distinguish an Element derived class from a Text derived class.

const::string &name() const;

Returns the name of the element.

const vector<const Attribute *> &attributes() const;

Returns a vector of attributes.

const vector<const Node *> &children() const;

Returns a vector of children nodes.

class Text inherits from Node

This class represents XML text content. It inherits from the Node class. RTTI can be used to distinguish an Element derived class from a Text derived class.

const std::string &str() const;

Returns the text content as a string.

class Attribute

This class is used to represent an XML attribute.

const std::string &name() const;

Returns the name of the attribute.

const std::string &value() const;

Returns the value of the attribute.

class XmlHandler

This class is used to define the interface that a SAX-style callback handler must use. A user of this library must inherit from this class, and implement all the functions.

virtual void handleOpenStartTag(const std::string &name);

This callback is called by the parser when a start tag is encountered.

virtual void handleAttribute(const Attribute *);

This callback is called by the parser when an attribute is encountered.

virtual void handleCloseStartTag();

This callback is called by the parser to indicate that no more attributes for this start-tag will be encountered.

virtual void handleContent(const std::string &text);

This callback is called by the parser to indicate that some text content has been encountered.

virtual void handleEndTag();

This callback is called by the parser to indicate that the end of the current element has been encountered.

class Parser

static Element *parseDOM(const std::string &);

Parses the XML document given as the string parameter, and returns a pointer to the root element.

static void parseSAX(const std::string &xml, XmlHandler *);

Parses the XML document given as the string parameter, and calls callbacks on the XmlHandler object passed in.

Usage

To create a DOM-style tree, call Parser::parseDOM() method, passing in the XML document as a string. This method returns a pointer to the root Element object.

Children of the element may be accessed through the Element::children() method. RTTI may be used to determine if a given Node object represents an element or text content.

To parse using SAX-style, call Parser::parseSAX(), passing in the XML document as a string, and a pointer to an XmlHandler object to receive the callbacks.

You should also create test XML documents as part of this second phase. These documents should be designed to test the correctness of parsers. I will collect all test documents, and run all parsers against them. You will receive bonus credit if your test document reveals a bug or defect in another person's parser.

4.1 Submission and Evaluation of Second Phase

You should submit a compressed tar file containing your code. It should include a makefile, such that after I type make, there will be two executables (or scripts, if you are using Java) in the directory.

The first executable should be named parse_tree, and should parse an XML document using the tree-based API, and print out the XML document in some easily readable format.

The second executable should be named parse_callback, and should do the same thing, except using a callback-based API.

Both executables should take one argument, which is the name of the file containing the XML documenn to parse.

You should maintain a clear separation in your code between your parser, and the driver programs. Thus, it should be easy for you to package your parser as a library.

Your tar file should also contain a directory named test_docs which contains a set of test documents.

You should also include a file named report.txt to give comments and other information. (Other formats are also acceptable, in which case you may have a different extension.) While there is nothing that this document strictly needs to include, you may have things about your submission that you wish to let me know, etc. Basically, this document is a chance for you to enhance your grade by telling me things that may not be obvious from running your code.

Evaluation will be based on design (about 40%), correctness (about 40%), and performance (about 20%). This division of points is not completely rigid, however. There is often a tradeoff between design and performance. So, I will take into consideration that a highly encapsulated, flexible design may have poorer performance. Likewise, I will take into consideration that trying to squeeze the last bit of performance out of some code may compromise the elegance of the design.

5 Notes

The point division between part one and two does not necessarily reflect on the relative performance I place between the two, but also the amount of work involved.

Your submission will be partially graded by scripts, so you should follow the instructions and formats exactly. Submissions that do not conform to these requirements will not recieve credit.