(This document was last modified on September 24 at 9:30 PM.)
For this assignment, you are to design two different XML parsing APIs, using different styles, and then implement the APIs.
You may use either C++ or Java for this assignment.
For this assignment, you will use a simplified form of XML, containing only start-tags, end-tags, content, and attributes. The format is given below, by example.
< elem1 > This is some content. <elem2 attr1="value1">Content.</elem2> < elem2 attr1 = "value1" ></ elem2 > Some more content. <elem3/> </elem1>
The element
<elem3/>
is what we call an empty element.
It is completely equivalent to
<elem3></elem3>.
White space may be present at a number of different points, as shown above. White space can consist of spaces, tabs, or newlines. Note that the content includes any whitespace. For example, the content of this element is exactly a newline followed by two spaces.
<elem> </elem>
Any XML parser needs to present its functionality via an API of some kind. There are two popular styles, one is tree-based, and the other is callback-based. You will design and implement both.
In both cases, your API only needs to support accessing an XML document. It does not need to support constructing the actual objects in the API. Of course, there must still be some way of creating those objects, but it can consist of private methods and/or public methods known only to your parser implementation.
Tree-based APIs are essentially data structures for tree-like data. They present the XML document as a tree of nodes. They provide a way of representing the document tree, and a way of traversing it. Note that an element may have two kinds of child nodes. The first kind is character content, which would be a string. The second kind is a child element. These kinds may be mixed within the children of a single element.
Elements may also have attributes. Some way should be provided to access the set of attributes of each element.
Your API should also provide some way to access the parent of a given element.
As an example of what such an API is like, you can look at the DOM (Level 1) API given here. Wikipedia also provides a good introduction.
A problem with tree-based APIs is that they can use a lot of memory, especially for large XML documents. As a solution, people have also designed streaming, callback-based APIs. In this style, the application provides callbacks to the parser, which are then called by the parser when it recognizes specific pieces. No complete representation of the whole XML document ever needs to be in memory at any single point in time.
You should include callbacks to:
Empty elements, as described above, should be handled as if they consisted of a start-tag and end-tag pair with attributes and no content.
Your API will include some class to represent your streaming parser. You also need some way to initialize this parser with the callbacks provided by the application.
SAX is the best known example of a callback-based API. Details of SAX can be found here at the official site, and here at Wikipedia.
For the first phase of this project, you will turn in your API designs, one for each style. This should include enough information so that a user can call and use your parsers, and avoid any syntax or runtime errors.
This design will include the C++/Java classes/interfaces that the user will need. It does not need to include private or other methods that the user does not need to call, but that may be required by your implementation. So it will likely not include all methods in those classes or interfaces. It should also include some explanation of what each method does.
Your design should also include error-handling. You may use exceptions or other mechanisms.
E-mail the design document to cs340-internal@cs.binghamton.edu. It should include some rationale and discusssion, and the API. I do not expect this document to total more than about 4 pages for both APIs, though yours may turn out to be shorter or longer.
Evaluation will consist of how easy the API is to use, and the clarity and completeness of your document.
For the second phase of this project, you will implement your APIs, after I approve and possibly modify them. Note that you should try to share as much of your code as possible between the two API implementations.
For reference, or if you did not turn in Phase 1, you may refer to the API given below. In Phase 2, you are free to modify your APIs submitted in Phase 1.
class Node
This class is a base class. It is polymorphic, and so will work with RTTI.
class Element inherits from Node
This class represents XML elements. It inherits from theNodeclass. RTTI can be used to distinguish anElementderived class from aTextderived class.
const::string &name() const;Returns the name of the element.
const vector<const Attribute *> &attributes() const;Returns a vector of attributes.
const vector<const Node *> &children() const;Returns a vector of children nodes.
class Text inherits from Node
This class represents XML text content. It inherits from theNodeclass. RTTI can be used to distinguish anElementderived class from aTextderived class.
const std::string &str() const;Returns the text content as a string.
class Attribute
This class is used to represent an XML attribute.
const std::string &name() const;Returns the name of the attribute.
const std::string &value() const;Returns the value of the attribute.
class XmlHandler
This class is used to define the interface that a SAX-style callback handler must use. A user of this library must inherit from this class, and implement all the functions.
virtual void handleOpenStartTag(const std::string &name);This callback is called by the parser when a start tag is encountered.
virtual void handleAttribute(const Attribute *);This callback is called by the parser when an attribute is encountered.
virtual void handleCloseStartTag();This callback is called by the parser to indicate that no more attributes for this start-tag will be encountered.
virtual void handleContent(const std::string &text);This callback is called by the parser to indicate that some text content has been encountered.
virtual void handleEndTag();This callback is called by the parser to indicate that the end of the current element has been encountered.
class Parser
static Element *parseDOM(const std::string &);Parses the XML document given as the string parameter, and returns a pointer to the root element.
static void parseSAX(const std::string &xml, XmlHandler *);Parses the XML document given as the string parameter, and calls callbacks on the XmlHandler object passed in.
To create a DOM-style tree,
call Parser::parseDOM() method,
passing in the XML document as a string.
This method returns a pointer to the root
Element object.
Children of the element may be accessed through
the
To parse using SAX-style,
call Element::children() method.
RTTI may be used to determine if a
given Node object represents
an element or text content.
Parser::parseSAX(),
passing in the XML document as a string,
and a pointer to an XmlHandler object to
receive the callbacks.
You should also create test XML documents as part of this second phase. These documents should be designed to test the correctness of parsers. I will collect all test documents, and run all parsers against them. You will receive bonus credit if your test document reveals a bug or defect in another person's parser.
You should submit a compressed tar file containing your code. It should include a makefile, such that after I type make, there will be two executables (or scripts, if you are using Java) in the directory.
The first executable should be named
parse_tree,
and should parse an XML document using
the tree-based API,
and print out the XML document in some
easily readable format.
The second executable should be named
parse_callback,
and should do the same thing,
except using a callback-based API.
Both executables should take one argument, which is the name of the file containing the XML documenn to parse.
You should maintain a clear separation in your code between your parser, and the driver programs. Thus, it should be easy for you to package your parser as a library.
Your tar file should also contain
a directory named test_docs
which contains a set of test documents.
You should also include a file named
report.txt to give comments
and other information.
(Other formats are also acceptable,
in which case you may have a different extension.)
While there is nothing that this document
strictly needs to include,
you may have things about your submission
that you wish to let me know, etc.
Basically,
this document is a chance for you to enhance
your grade by telling me things that may not
be obvious from running your code.
Evaluation will be based on design (about 40%), correctness (about 40%), and performance (about 20%). This division of points is not completely rigid, however. There is often a tradeoff between design and performance. So, I will take into consideration that a highly encapsulated, flexible design may have poorer performance. Likewise, I will take into consideration that trying to squeeze the last bit of performance out of some code may compromise the elegance of the design.
The point division between part one and two does not necessarily reflect on the relative performance I place between the two, but also the amount of work involved.
Your submission will be partially graded by scripts, so you should follow the instructions and formats exactly. Submissions that do not conform to these requirements will not recieve credit.