Recursive Descent Parsing
Revision History | ||
---|---|---|
2014-03-07 | ||
Andrew Duncan cleverly noticed the productions in Example 2 left out the right-recursive terms even though the code examples parsed them. This error was carried over from the original May 2001 column and had gone unnoticed until today. I've updated the right-recursive grammar to include the missing terms. Thanks Andrew! | ||
2012-05-19 | ||
Converted and edited 2001 version of article for Web. |
This article was published originally in the May 2001 issue of Java Pro with the title “Express Yourself.” That column was especially popular. Over the years, I've run into code in programs copied from it verbatim. Please don't do that. Sample code is for learning, not production use—and you don't learn much from copying. Use what you learn to write your own original and improved code.
The techniques described in this article form the basis for the Vareos Calculator application, a scientific calculator implemented in JavaScript and SVG. You can find an expanded version of the expression grammar from this article in the help page for Calculator.
Changes. I have updated the code to use Java features that didn't
exist at the time. Specifically, the code now uses
enumerations, generics, CharSequence
instead of String
for input, the
so-called “enhanced for loop,”
and java.util.regex instead
of jakarta-oro for
regular expressions. Also, the exponentiation operator has
changed from **
to ^
.
XML Isn't Always the Answer
Even though XML provides a versatile information representation structure, it is not appropriate for use by all applications. Sometimes information is expressed more appropriately in a form as close as possible to its natural representation. For example, even though mathematical expressions can be encoded in XML, it is more natural to define a grammar for parsing mathematical expressions if you are writing a calculator or spreadsheet program.
When XML isn't the right tool, you must assume the burden of parsing and interpreting information, rather than use one of many available XML parsers.[1] Even with an XML parser, you must interpret the meaning of what has been parsed relative to your application's semantics. Before you attempt to write a parser, you need to understand the structure of the information you are working with. The information may be passive data, requiring conversion into program data structures, or it may be active data (e.g., a spreadsheet macro), encoding a set of instructions requiring interpretation. In either case, the information must be translated into another form.
Context-Free Grammars
An information stream that a program parses can be thought of as a language, and the specification of its structure as a syntax. You can define the syntax of many languages using a context-free grammar. Context-free grammars were first applied to describing computer languages when John Backus drafted the Algol 60 specification. As a historical note, Algol is the origin of many of the syntactic structures present in the C language and its syntactic brethren, including Java. The notation Backus used later became known as Backus-Naur Form (BNF) in honor of both Backus's and Naur's contributions to the Algol 60 report.
BNF defines a set of productions consisting of a non-terminal followed by a combination of terminals and non-terminals. A terminal is an atomic token, like a parenthesis. A non-terminal is a language construct that expands ultimately into a series of tokens, such as a mathematical expression.
Non-terminals and Operators | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Terminals | |||||
---|---|---|---|---|---|
|
The preceding grammar defines a syntax for simple
mathematical expressions, including addition, subtraction,
multiplication, division, exponentiation, and parenthetical
grouping. The ::=
symbol separates the left
side of the production from the right side.
The |
symbol separates possible expansions of a
production. In the example grammar, operators of equal precedence
occur in the same production. Operators of lower precedence, such
as addition and subtraction, occur in productions that are
expanded first. Productions are expanded beginning with a
non-terminal designated as the start symbol, which is, by
convention, defined by the first production—in this case
expression.
The number terminal expands to a regular
expression that matches number literals.[2]
The presented grammar has one problem that makes it difficult to use for parsing: The recursion present in the productions for expression and term is left-recursive. A parser has no way of predicting which production to apply at any given stage, making it possible to enter an infinite loop. For a parser to predict which production to apply, it must be able to choose unambiguously a production based on an input token.
Left-recursive productions of the form
|
can be transformed into right-recursive productions of the following form:
|
The empty expansion after the bar in the second production represents the empty set. Using this transformation rule, the example grammar can be transformed into the following right-recursive grammar:
|
Tokenization
The right-recursive grammar is well-suited to predictive parsing because each input token indicates unambiguously which production should be applied. Before parsing an expression with this grammar, the expression's constituent elements must be identified. This process is called tokenization: the act of converting an input stream into a set of tokens.
Figure 1
presents a Tokenizer
class for performing
relatively general tokenization based on a set of regular
expressions. setInput()
defines the
input for Tokenizer
. The
Tokenizer
constructor initializes a list of
TokenType
(see
Figure 8)
instances, each of which associates a lexeme with a token. A
lexeme is an instance of a token in the input. For example,
both +
and -
are arithmetic
operators of equal precedence and therefore represent the same
type of token, but each is a different lexeme.
Regular expressions are a relatively compact and flexible
way of denoting tokens; they are a standard part of lexical
analysis tools such as Lex. Even though regular expressions may
be overkill for tokenizing mathematical expressions, I didn't want
to hard code lexemes into Tokenizer
. I
wanted to make it easy to represent tokens such as floating point
numbers, which would otherwise require a separate grammar
definition, placing their extraction in the parsing instead of the
tokenization process. Therefore, the lexemes corresponding to
each TokenType
are encoded in a regular
expression stored in each TokenType
instance.
Tokenizer
extracts tokens from the
input stream one at a time via
the nextToken()
method. First, the
method skips over any token separators—by default, whitespace.
Then it attempts to match each token type pattern, starting from
the current input offset, until it finds a match. It is possible
to implement the method more efficiently, so that only one match
attempt needs to be attempted per method call. Doing so, however,
would make the method harder to understand as an example.
The Token
class in
Figure 2
represents a token as a token value and a token type. The
MathTokenType
class in
Figure 3 is
both an enumeration of token types—each token type stores a
corresponding regular expression pattern matching instances of the
type—and a container for lexeme definitions.
The MathOperator
class in
Figure 4
defines an enumeration of operators, a mapping between lexemes and
operators, as well as an evaluation function for each
operator.
Parsing and Evaluation
Figure 5 shows how to put the tokenization classes together to parse an expression and convert to an intermediate representation for later evaluation. Even though it is possible to compute the value of an expression as part of the parsing process, I wanted to demonstrate the separation between parsing and evaluation present in systems such as scripting languages, which compile a program into an intermediate byte code that is interpreted later.
MathParser
applies recursive descent
parsing to convert an infix mathematical expression to a postfix
expression stored in a stack. Postfix expressions have only one
possible interpretation, making them easy to evaluate with
MathEvaluator
(see
Figure 6).
MathEvaluator
does not attempt to be
efficient; instead, it demonstrates how postfix expressions can be
evaluated easily. Operators are popped off the stack and their
operands are evaluated before applying the operator and pushing
the result back onto the stack.
The recursive descent parsing algorithm used
by MathParser
is a form of top-down
parsing. Top-down parsing applies productions to its input,
starting with the start symbol and working its way down the chain
of productions, creating a parse tree defined by the sequence of
recursive non-terminal expansions. The tree is such that the
parsing starts at the top and works its way down to the leaves,
where terminal symbols (in our example, numbers) reside,
terminating recursion.
Recursive descent parsers are straightforward to implement
once you have defined a right-recursive grammar. All you have to
do is write a function for each production. The functions mirror
exactly the productions; they decide how to expand a non-terminal
based on a lookahead token. The lookahead token is provided
by Tokenizer
. For
example, parseExpression()
directly
calls parseTerm()
because term is the first non-terminal in
the production. Rather than write a separate function for
moreterms, I've inlined the function. If
the lookahead token is null, moreterms is
expanded to the empty set; if it is an additive operator, it is
expanded based on the presence of the operator. Eventually,
either an unexpected token will be encountered, causing
a ParseException
(see
Figure 10),
or the expression will be parsed successfully.
If you were writing a calculator program with partial result
evaluation, you would evaluate the expression while you were
parsing it. MathParser
translates the
infix expression to a postfix representation for later evaluation,
a technique more appropriate for a command-line calculator such as
the POSIX bc
command. MathToken
translates the
expression by pushing operators onto a stack after a non-terminal
has been expanded, and pushing numbers onto the stack when they
are encountered at the leaves of the parse tree. The
expression 1+2*3
becomes
1 2 3 * +
and 1-2+3
becomes
1 2 - 3 +
. If a LISP-like prefix notation were
the target of the translation, operators would be pushed onto the
stack before a non-terminal was expanded, converting
1+2*3
into + 1 * 2 3
.
The ParseExample
driver program in
Figure 7
combines MathParser
and MathEvaluator
to calculate mathematical
expressions given on the command line. A sample run verifies that
operator precedence is enforced properly:
Parser Generators
For complicated grammars, writing your own recursive descent parser can require a lot of work. In addition, not all grammars can be transformed for recursive descent parsing. Complex inputs can create a lot of recursion and make you run out of stack space. There exist more efficient ways to parse languages based on finite state machines and transition tables. For anything but the simplest grammars, you will want to use a compiler to generate a parser for you, provided you define the grammar. Perhaps the most well-known tool for generating parsers in C is YACC (Yet Another Compiler Compiler). Similar tools, such as JavaCC and ANTLR,[3] are available for Java. They are both able to parse a class of grammars called LL(k) and differ fundamentally from YACC in how they generate parsers.
I've glossed over many details. Studying the sample code should fill in the gaps. For a more detailed treatment of the topic, you may wish to read Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman (Addison-Wesley, 1986).
Support Classes
[1] At the time the original article was written, XML was being overused and misused for every data-representation need. Today, programmers aren't as wedded to XML as back then (e.g., JSON has become widely accepted), making the introductory paragraphs less apropos of the times.
[2] Regular expressions are used to extract terminal tokens from the input stream. The regular expression for number doesn't match negative numbers. Negative numbers can be handled by adding a unary negation operator to the grammar.
[3] In 2001, these were popular tools. Other tools may be more popular today.