Day 4 – Parsing with Grammars (Book Extract)

The following is an extract from the book Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing, by Moritz Lenz, published 2017 by Apress Media, LLC. Reprinted with permission.

This books is being published right now. At least the ebook version should become available for purchase this month, and print version can be pre-ordered from Amazon. It should ship in January 2018 at the latest, but with a bit of luck, it’ll be available by Christmas.

Below you can find chapter 9, Parsing with Grammars. The previous chapters discuss the building blocks of regexes in great detail, how regexes interact with Perl 6 code, match objects, regex mechanics, common regex techhniques, and reusing and composing regexes. You can acquire some of this background by reading the official documentation on regexes.

Later chapters cover action classes and objects, how to report high-quality parse errors, Unicode support, and finally three case studies.

And now, enjoy!


Grammars are the proverbial Swiss-army chain saw1 for parsing.

In this chapter, we will explore them in more detail. Most importantly, we will discuss how to harness their power.

Understanding Grammars

Grammars implement a top-down approach to parsing. The entry point, usually the regex TOP, knows about the coarse-grained structure and calls further regexes that descend into the gory details. Recursion can be involved too. For example, if you parse a mathematical expression, a term can be an arbitrary expression inside a pair of parentheses.

This is a top-down structure, or more precisely a recursive descent parser. If no backtracking is involved, we call it a predictive parser, because at each position in the string, we know exactly what we’re looking for — we can predict what the next token is going to be (even if we can only predict that it might be one of a set of alternatives).

The resulting match tree corresponds in structure exactly to the call structure of regexes in the grammar. Let’s consider parsing a mathematical expression that only includes the operators *, +, and parentheses for grouping:

grammar MathExpression {
    token TOP    { <sum> }
    rule sum     { <product>+ %  '+' }
    rule product { <term>+ % '*' }
    rule term    { <number> | <group> }
    rule group   { '(' <sum> ')' }
    token number { \d+ }
}

say MathExpression.parse('2 + 4 * 5 * (1 + 3)');

From the grammar itself you can already see the potential for recursion: sum calls product, which calls term, which calls group, which calls sum again. This allows parsing of nested expressions of arbitrary depth.

Parsing the example above produces the following match object:

⌜2 + 4 * 5 * (1 + 3)⌟
 sum => ⌜2 + 4 * 5 * (1 + 3)⌟
  product => ⌜2 ⌟
   term => ⌜2 ⌟
    number => ⌜2⌟
  product => ⌜4 * 5 * (1 + 3)⌟
   term => ⌜4 ⌟
    number => ⌜4⌟
   term => ⌜5 ⌟
    number => ⌜5⌟
   term => ⌜(1 + 3)⌟
    group => ⌜(1 + 3)⌟
     sum => ⌜1 + 3⌟
      product => ⌜1 ⌟
       term => ⌜1 ⌟
        number => ⌜1⌟
      product => ⌜3⌟
       term => ⌜3⌟
        number => ⌜3⌟

If you want to know how a particular number was parsed, you can follow the path backwards by looking for lines above the current line that are indented less; for instance, the number 1 was parsed by token number, called from term, called from product, and so on.

We can verify this by raising an exception from token number:

    token number {
        (\d+)
        { die "how did I get here?" if $0 eq '1' }
    }

This indeed shows the call chain in the backtrace, with the most immediate context at the top:

how did I get here?
  in regex number at bt.p6 line 9
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex group at bt.p6 line 6
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex TOP at bt.p6 line 2
  in block <unit> at bt.p6 line 13

This grammar only uses tokens and rules, so there is no backtracking involved, and the grammar is a predictive parser. This is fairly typical. Many grammars work fine without backtracking, or with backtracking in just a few places.

Recursive Descent Parsing and Precedence

The MathExpression grammar has two rules which are structurally identical:

    rule sum { <product>+ %  '+' }
    rule product { <term>+ % '*' }

Instead, we could have written

rule  expression { <operator>+ % <term> }
token operator   {  '*' | '+' }

or even used the proto token construct discussed in the previous chapter to parse different operators. The reason I chose the first, more repetitive, approach is that it makes the match structure correspond to the precedence of the operators * and +.

When evaluating the mathematical expression 1 + 2 * 5, mathematicians and most programming languages evaluate the 2 * 5 first, because the * operator has tighter precedence than +. The result is then substituted back into the expression, leading to 1 + 10, and finally 11 as the result.

When parsing such expressions with the first version of the grammar, the structure of the parse tree expresses this grouping: it has — as the top level — a single sum, with the operands being 1 and 2 * 5.

This comes at a cost: we need a separate rule and name for each precedence level, and the nesting of the resulting match object has at least one level per precedence level. Furthermore, adding more precedence levels later on is not trivial, and very hard to do in a generic way. If you are not willing to accept these costs, you can instead use the flat model with a single token for parsing all operators. If you then need the structure in a way that reflects precedence, you can write code that transforms the list into a tree. This is commonly called an operator precedence parser.

Left Recursion and Other Traps

To avoid infinite recursion, you have to take care that each possible recursion cycle advances the cursor position by at least one character. In the MathExpression grammar, the only possible recursion cycle is sumproducttermgroupsum, and group can only match if it consumes an initial open parenthesis, (.

If recursion does not consume a character, it is called left recursion and needs special language support that Perl 6 does not offer. A case in point is

token a { <a>? 'a' }

which could match the same input as the regex a+, but instead loops infinitely without progressing.

A common technique to avoid left recursion is to have a structure where you can order regexes from generic (here sum) to specific (number). You only have to be careful and check for consumed characters when a regex deviates from that order (e.g. group calling sum).

Another potential source of infinite loops is when you quantify a regex that can match the empty string. This can happen when parsing a language that actually allows something to be empty. For instance, in UNIX shells, you can assign variables by potentially leaving the right-hand side empty:

VAR1=value
VAR2=

When writing a grammar for UNIX shell commands, it might be tempting to write a token string { \w* } that would potentially match an empty string. In a situation that allows for more than one string literal, <string>+ can then hang, because the effective regex, [\w*]+, tries to match a zero-width string infinitely many times.

Once you are aware of the problem, the solution is pretty simple: change the token to not allow an empty string (token string { \w+ }), and explicitly take care of situations where an empty string is allowed:

    token assignment {
        <variable> '=' <string>?
    }

Starting Simple

Even though a grammar works from the top down, developing a grammar works best from the bottom up. It is often not obvious from the start what the overall structure of a grammar will be, but you usually have a good idea about the terminal tokens: those that match text directly without calling other subrules.

In the earlier example of parsing mathematical expressions, you might not have known from the start how to arrange the rules that parse sums and products, but it’s likely that you knew you had to parse a number at some point, so you can start by writing:

grammar MathExpression {
    token number { \d+ }
}

This is not much, but it’s also not very complicated, and it’s a good way to get over the writer’s block that programmers sometimes face when challenged with a new problem area. Of course, as soon as you have a token, you can start to write some tests:

grammar MathExpression {
    token number { \d+ }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 2;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';
}

Now you can start to build your way up to more complex expressions:

grammar MathExpression {
    token number { \d+ }
    rule product { <number>+ % '*' }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 5;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';

    ok MathExpression.parse('1234', :rule<product>),
        '<product> can parse a simple number';
    ok MathExpression.parse('1*3*4', :rule<product>),
        '<product> can parse three terms';
    ok MathExpression.parse('1 * 3', :rule<product>),
        '<product> and whitespace';
}

It is worth it to include whitespace early on in the tests. The example above looks innocent enough, but the last test actually fails. There is no rule that matches the whitespace between the 1 and the *. Adding a space in the regex between the <number> and the + quantifier makes the tests pass again, because the whitespace inserts an implicit <.ws> call.

Such subtleties are easy to catch if you start really simple and catch them as soon as possible. If instead you give in to the temptation of writing down a whole grammar from top to bottom, you can spend many hours debugging why some seemingly simple thing such as an extra space makes the parse fail.

Assembling Complete Grammars

Once you have written the basic tokens for lexical analysis, you can progress to combining them. Typically the tokens do not parse whitespace at the borders of their matches, so the rules that combine them do that.

In the MathExpression example in the previous section, rule product directly called number, even though we now know that the final version uses an intermediate step, rule term, which can also parse an expression in parentheses. Introducing this extra step does not invalidate the tests we have written for product, because the strings it matched in the early version still match. Introducing more layers happens naturally when you start with a grammar that handles a subset of the language, which you later expand.

Debugging Grammars

There are two failure modes for a regex or a grammar: it can match when it’s not supposed to match (a false positive), or it can fail to match when it’s supposed to match (a false negative). Typically, false positives are easier to understand, because you can inspect the resulting match object and see which regexes matched which part of the string.

There is a handy tool for debugging false negatives: the Grammar::Tracer module. If you load the module in a file containing a grammar, running the grammar produces diagnostic information that can help you find out where a match went wrong.

Note that this is only a diagnostic tool for developers; if you want to give end users better error messages, please read Chapter 11 for improvement suggestions.

You need to install the Perl 6 module Grammar::Debugger, which also contains Grammar::Tracer. If you use the moritzlenz/perl6-regex-alpine docker image, this is already done for you. If you installed Perl 6 via another method, you need to run

zef install Grammar::Debugger

on the command line. If zef is not yet installed, follow the installation instructions on the zef GitHub page.

Let’s look at the Perl 6 module Config::INI by Tadeusz Sośnierz. It contains the following grammar (slightly reformatted here):

grammar INI {
    token TOP {
        ^ <.eol>* <toplevel>?  <sections>* <.eol>* $
            }
    token toplevel { <keyval>* }
    token sections { <header> <keyval>* }
    token header   { ^^ \h* '[' ~ ']' $<text>=<-[ \] \n ]>+
                     \h* <.eol>+ }
    token keyval   { ^^ \h* <key> \h* '=' \h* <value>? \h*
                     <.eol>+ }
    regex key      { <![#\[]> <-[;=]>+ }
    regex value    { [ <![#;]> \N ]+ }
    token eol      { [ <[#;]> \N* ]? \n }
}

Suppose we want to understand why it does not parse the following piece of input text:

a = b
[foo]
c: d

So, before the grammar, we insert the line

use Grammar::Tracer;

and after it, add a small piece of code that calls the .parse method of that grammar:

INI.parse(q:to/EOF/);
a = b
[foo]
c: d
EOF

This produces a sizable, but fairly informative piece of output.

Each entry consists of a name of a regex, like TOP or eol (for "end of line"), followed by the indented output of the regexes it calls. After each regex comes a line containing an asterisk (*) and either MATCH followed by the string segment that the regex matched, or FAIL if the regex failed.

Let’s look at the output piece by piece, even if it comes out in one chunk:

TOP
|  eol
|  * FAIL
|  toplevel
|  |  keyval
|  |  |  key
|  |  |  * MATCH "a "
|  |  |  value
|  |  |  * MATCH "b"
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "a = b\n"
|  |  keyval
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  * MATCH "a = b\n"

This tells us that TOP called eol, which failed to match. Since the call to eol is quantified with *, this does not cause the match of TOP to fail. TOP then calls key, which matches the text "a", and value, which matches "b". The eol regex then proceeds to match the newline character, fails on the second attempt (since there are no two newline characters in a row). This causes the initial keyval token to match successfully. A second call to keyval matches pretty quickly (in the call to key). Then, the match of token toplevel proceeds successfully, consuming the string "a = b\n".

So far this all looks as expected. Now let’s take a look at the second chunk of output:

|  sections
|  |  header
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "[foo]\n"
|  |  keyval
|  |  |  key
|  |  |  * MATCH "c: d\n"
|  |  * FAIL
|  * MATCH "[foo]\n"

TOP next calls sections, wherein token header successfully matches the string "[foo]\n". Then, keyval calls key, which matches the whole line "c: d\n". Wait, that’s not right, is it? We might have expected key to only match the c. I certainly wouldn’t have expected it to match a newline character at the end. The lack of an equals sign in the input causes the regex engine to never even call regex value. But since keyval is again quantified with the star * quantifier, the match of the calling regex sections succeeds in matching just the header "[foo]\n".

The last part of the Grammar::Tracer output follows:

|  sections
|  |  header
|  |  * FAIL
|  * FAIL
|  eol
|  * FAIL
* FAIL

It’s FAILs from here on. The second call to sections again tries to parse a header, but its next input is still "c: d\n", so it fails. As does the end-of-string anchor $ in token TOP, failing the overall match in method parse.

So we have learned that regex key matched the whole line c: d\n, but since no equals sign (=) follows it, token keyval cannot parse this line. Since no other regex (notably not header) matches it, this is where the match fails.

As you can see from this example run, Grammar::Tracer enables us to pinpoint where a parse failure happens, even though we had to look carefully through its output to locate it. When you run it in a terminal, you automatically get colored output, with FAIL having a red and MATCH a green background, and token names standing out in bold white (instead of the usual gray) output. This makes it easier to scan from the bottom (where a failed match usually leaves a trail of red FAILs) up to the trailing successful matches, and then look in the vicinity of the border between matches and failures.

Since debugging imposes a significant mental burden, and the output from Grammar::Tracer tends to grow quickly, it is generally advisable to reduce the failing input to a minimal specimen. In the case described above, we could have removed the first line of the input string and saved ten lines of Grammar::Tracer output to look through.

Parsing Whitespace and Comments

As said before, the idiomatic way to parse insignificant whitespace is by calling <.ws>, typically implicitly by using whitespace in a rule. The default ws implementation, <!ww>\s*, works well for many languages, but it has its limits.

In a surprising number of file formats and computer languages, there is significant whitespace that <.ws> would just gobble up. These include INI files (where a newline typically indicates a new key/value pair), Python and YAML (where indentation is used for grouping), CSV (where a newline signals a new record), and Makefiles (where indentation is required to be with a tabulator character).

In these cases, it is best practice to override ws in your own grammar to match only insignificant whitespace. Let’s take a look at a second, minimalistic INI parser, independently developed from the one described in the previous section:

grammar INIFile {
    token TOP { <section>* }
    token section {
        <header>
        <keyvalue>*
    }
    rule header {
        '['  <-[ \] \n ]>+ ']' <.eol>
    }
    rule keyvalue {
        ^^
        $<key>=[\w+]
        <[:=]>
        $<value>=[<-[\n;#]>*]
        <.eol>
    }
    token ws { <!ww> \h* }
    token eol {
        \n [\h*\n]*
    }
}

This parses simple INI configuration files like this:

[db]
driver: mysql
host: db01.example.com
port: 122
username: us123
password: s3kr1t

Take note how this grammar uses two paths for parsing whitespace: a custom ws token that only matches horizontal whitespace (blanks and tabs), and a separate token eol that matches (significant) line breaks. The eol token also gobbles up further lines consisting only of whitespace.

If a language supports comments, and you don’t want them to appear in your parse tree, you can parse them either in your ws token, or in eol (or your equivalent thereof). Which one it is depends on where comments are allowed. In INI files, they are only allowed after a key-value pair or in a line on their own, so eol would be the fitting place. In contrast, SQL allows comments in every place where whitespace is allowed, so it is natural to parse them in ws:

# comment parsing for SQL:
token ws { <!ww> \s* [ '--' \N* \n ]* }

# comment parsing for INI files:
token eol { [ [ <[#;]> \N* ]? \n ]+ }

Keeping State

Some of the more interesting data formats and languages require the parser to store things (at least temporarily) to be able to correctly parse them. A case in point is the C programming language, and others inspired by its syntax (such as C++ and Java). Such languages allow variable declarations of the form type variable = initial_value, like this:

int x = 42;

This is valid syntax, but only if the first word is a type name. In contrast, this would be invalid, because x is not a type:

int x = 42;
x y = 23;

From these examples it is pretty clear that the parser must have a record of all the types it knows. Since users can also declare types in their code files, the parser must be able to update this record.

Many languages also require that symbols (variables, types, and functions) be declared before they are referenced. This too requires the grammar to keep track of what has been declared and what hasn’t. This record of what has been declared (and what is a type or not, and possibly other meta information) is called a symbol table.

Instead of parsing the full C programming language, let’s consider a minimalist language that just allows assignments of lists of numbers, and variables to variables:

a = 1
b = 2
c = a, 5, b

If we don’t impose declaration rules, it’s pretty easy to write a grammar:

grammar VariableLists {
    token TOP        { <statement>* }
    rule  statement  { <identifier> '=' <termlist> \n }
    rule  termlist   { <term> * % ',' }
    token term       { <identifier> | <number> }
    token number     { \d+ }
    token identifier { <:alpha> \w* }
    token ws         { <!ww> \h* }
}

Now we demand that variables can only be used after they’ve been assigned to, so that the following input would be invalid, because b is not declared in the second line, where it’s used:

a = 1
c = a, 5, b
b = 2

To maintain a symbol table, we need three new elements: a declaration of the symbol table, some code that adds a variable name to the symbol table when the assignment has been parsed, and finally a check whether a variable has been declared at the time we come across it in a term list:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

In the token TOP, :my %*SYMBOLS declares a variable. Declarations in regexes start with a colon (:), and end with a semicolon (;). In between they look like normal declarations in Perl 6. The % sigil signals that the variable is a hash — a mapping of string keys to values. The * makes it a dynamic variable — a variable that is not limited to the current scope but also visible to code (or regexes, which are also code) that is called from the current scope. Since this is an unusually large scope, it is custom to choose a variable in CAPITAL LETTERS.

The second part, adding a symbol to the symbol table, happens in the rule statement:

    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }

Inside the curly braces is regular (non-regex) Perl 6 code, so we can use it to manipulate the hash %*SYMBOLS. The expression $<identifier> accesses the capture for the variable name2. Thus, if this rule parses a variable a, this statement sets %*SYMBOLS{ 'a' } = True.

The placement of the code block is relevant. Putting it before the call to termlist means that the variable is already known when the term list is parsed, so it accepts input like a = 2, a. If we call termlist first, this kind of input is rejected.

Speaking of rejection, this part happens in token variable. term now calls the new token variable (previously it called identifier directly), and variable validates that the symbol has been declared before:

    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }

You might remember from earlier examples that <?{ ... }> executes a piece of Perl 6 code, and fails the parse if it returns a false value. If $<identifier> is not in %*SYMBOLS, this is exactly what happens. At this time the non-backtracking nature of tokens is important. If the variable being parsed is abc, and a variable a is in %*SYMBOLS, backtracking would try shorter matches for <identifier> until it hits a, and then succeeds3.

Since %*SYMBOLS is declared in token TOP, you have to duplicate this declaration when you try to call rules other than TOP from outside the grammar. Without a declaration such as my %*SYMBOLS;, a call like

VariableLists.parse('abc', rule => 'variable');

dies with

Dynamic variable %*SYMBOLS not found

Implementing Lexical Scoping with Dynamic Variables

Many programming languages have the concept of a lexical scope. A scope is the area in a program where a symbol is visible. We call a scope lexical if the scope is determined solely by the structure of the text (and not, say, runtime features of the program).

Scopes can typically be nested. A variable declared in one scope is visible in this scope, and in all inner, nested scopes (unless an inner scope declares a variable of the same name, in which case the inner declaration hides the outer).

Coming back to the toy language of lists and assignments, we can introduce a pair of curly braces to denote a new scope, so this is valid:

a = 1
b = 2
{
    c = a, 5, b
}

but the next example is invalid, because it declares b only in an inner scope, and so it is not visible in the outer scope:

a = 1
{
    b = 2
}
c = a, 5, b

To implement these rules in a grammar, we can make use of an important observation: dynamic scoping in a grammar corresponds to lexical scoping in text it parses. If we have a regex block that parses both the delimiters of a scope and the things inside that scope, its dynamic scope is confined to all of the regexes it calls (directly and indirectly), and that is also the extent of the lexical scope it matches in the input text.

Let’s take a look at how we can implement dynamic scoping:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True; }
        '=' <termlist>
        \n
    }
    rule block {
        :my %*SYMBOLS = CALLERS::<%*SYMBOLS>;
        '{' \n*
            <statement>*
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

There are a few changes to the previous version of this grammar: the rule statement has been renamed to declaration and the new rule statement parses either a declaration or a block.

All the interesting bits happen in the block rule. The line :my %*SYMBOLS = CALLERS::<%*SYMBOLS>; declares a new dynamic variable %*SYMBOLS and initializes it with the previous value of that variable. CALLERS::<%*SYMBOLS> looks through the caller, and the caller’s caller, and so on for a variable %*SYMBOLS, and thus looks up the value corresponding to the outer scope. The initialization creates a copy of the hash, such that changes to one copy do not affect the other copies.

Let’s take a look at what happens when this grammar parses the following input:

a = 1
b = 2
{
    c = a, 5, b
}

After the first two lines, %*SYMBOLS has the value {a => True, b => True}. When rule block parses the opening curly bracket on the third line, it creates a copy of %*SYMBOLS. The declaration of c on the fourth line inserts the pair c => True into the copy of %*SYMBOLS. After rule block parses the closing curly brace on the last line, it exits successfully, and the copy of %*SYMBOLS goes out of scope. This leaves us with the earlier version of %*SYMBOLS (with only the keys a and b), which then goes out of scope when TOP exits.

Scoping Through Explicit Symbol Tables

Using dynamic variables for managing symbol tables usually works pretty well, but there are some edge cases where a more explicit approach works better. Such edge cases include those where there are so many symbols that copying becomes prohibitively expensive, or where more than the top-most scope must be inspected, or when copying the symbol table is impractical for other reasons.

Consequently, you can write a class for your symbol table (which in the simplest case uses an array as a stack of scopes) and explicitly call methods on it when entering and leaving scopes, when declaring a variable, and for checking whether a variable is known in a scope:

class SymbolTable {
    has @!scopes = {}, ;
    method enter-scope() {
        @!scopes.push({})
    }
    method leave-scope() {
        @!scopes.pop();
    }
    method declare($variable) {
        @!scopes[*-1]{$variable} = True
    }
    method check-declared($variable) {
        for @!scopes.reverse -> %scope {
            return True if %scope{$variable};
        }
        return False;
    }
}

grammar VariableLists {
    token TOP {
        :my $*ST = SymbolTable.new();
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { $*ST.declare( $<identifier> ) }
        '=' <termlist>
        \n
    }
    rule block {
        '{' \n*
            { $*ST.enter-scope() }
            <statement>*
            { $*ST.leave-scope() }
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ $*ST.check-declared($<identifier>) }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

The class SymbolTable has the private array attribute @!scopes which is initialized with a list containing a single, empty hash {}. Entering a scope means pushing an empty hash on top of this array, and when leaving the scope it is removed again through the pop method call. A variable declaration adds its name to the top-most hash, @!scopes[*-1].

Checking for the presence of a variable must not just consider the top-most hash, because variables are inherited to inner scopes. Here we go through the all scopes in reverse order, from inner-most to outer-most scope. The order of traversal is not relevant for a simple Boolean check, but if you need to look up information associated with the variable, it is important to adhere to this order to reference the correct one.

Token TOP creates a new object of class SymbolTable, declaration calls the declare method, and token variable calls method check-declared. The rule block calls enter-scope before parsing the statement list, and leave-scope afterwards. This works, but only if the statement list can be parsed successfully; if not, rule block fails before it manages to call leave-scope.

Perl 6 has a safety feature for such situations: if you prefix a statement with LEAVE, Perl 6 calls it for you at routine exit, in all circumstances where this is possible (even if an exception is thrown). Since the LEAVE phaser only works in regular code and not in regexes, we need to wrap the regex in a method:

    method block {
        $*ST.enter-scope();
        LEAVE $*ST.leave-scope();
        self.block_wrapped();
    }
    rule block_wrapped {
        '{' \n*
            <statement>*
        '}' \n*
    }

Now we have the same robustness as the approach with dynamic variables, and more flexibility to add extra code to the symbol table, at the cost of more code and increased effort.

Summary

Perl 6 grammars are a declarative way to write recursive descent parsers. Without backtracking, they are predictive; at each point, we know what list of tokens to expect.

The recursive nature of grammars comes with the risk of left recursion, a situation where a recursive path does not consume any characters, and so leads to an infinite loop.

In spite of the top-down nature of grammars, writing them typically happens from the bottom up: starting with lexical analysis, and then moving up to parsing larger structures.

Complex languages require additional state for successful and precise parsing. We have seen how you can use dynamic variables to hold state in grammars, how their scope corresponds to lexical scoping in the input, and how symbol tables can be written and integrated into grammars.


  1. Like a Swiss-army knife, but with much more power.

  2. At this point it is crucial that identifier does not parse its surrounding whitespace. Hence the principle that tokens do not care about whitespace, and the rules that call those tokens parse the whitespace.

  3. In this case, this would be harmless, because no other rule could match the rest of the variable, leading to a parse error nonetheless. But in more complicated cases, this kind of unintended backtracking can lead to errors that are very puzzling for the maintainer of the grammar.