Zlex: A Scanner Generator

Last Update Time-stamp: "97/06/29 16:04:46 zdu"

zlex is a lex-compatible scanner generator. Versions of zlex have been used internally by me for the past 3-4 years, and have been used by students in two compiler construction classes in 96 and 97. The major features which zlex offers over comparable scanners like lex or flex are the following:

16-bit character support
zlex supports the generation of scanners which process 16-bit character input. Unfortunately, due to the limitations of current editors, zlex still requires its own source file to be specified using 8-bit characters. Hence 16-bit characters need to be specified using their character codes (possibly encapsulated within macros). This feature has never been used in a real project as yet.

With this support for 16-bit characters, it should be possible to use zlex to built Unicode scanners, even though zlex does not know anything about Unicode per se.

Intra-token patterns
These are patterns which can be recognized within other tokens. Intra-token patterns are useful for doing the pre-lexical processing required by some programming languages (for example, the deletion of a `\' followed by a newline character in C). Intra-token patterns are used internally by zlex to keep track of line numbers (the yylineno variable of lex) without having to examine every input character.
Column numbers
zlex supports obtaining the column number of the current token (in addition to the undocumented yylineno feature of lex. The method used does not require the generated scanner to test each incoming character to see if it is a newline.
Character count
It is possible to access the count of the number of characters read from the current source file.
Sharing of code among multiple scanners
Much of the code required for a zlex scanner is linked in from the zlex library. This library code can be shared among multiple scanners. The only code unique to each scanner will be a relatively small main scanner function and possibly several auxiliary functions (this will be in addition to several large data tables which will be unique to each scanner).
Ambiguous right-context patterns.
Unlike other scanner generators, zlex can handle ambiguous right context where the pattern to be matched overlaps with the trailing context. The worst case complexity of the method used to identify such ambiguous trailing context can be quadratic.
Interactive scanners
As long as stdio input functions are not used, then all zlex generated scanners can operate interactively without any performance degradation.
Code Scanners
zlex supports the generation of directly encoded scanners in addition to the more conventional scanners which interpret tables. At this point, this option does not appear particularly useful.
Whitespace within patterns
There is an option which makes zlex more tolerant of spaces and comments in the zlex file. This allows the zlex programmer to format patterns so that they are more readable.

Unfortunately, I am still not satisfied with the performance of directly encoded scanners generated by zlex. I will release a public version of zlex once I get to the bottom of this problem.


Feedback: Please email any feedback to zdu@acm.org.

Other Projects