Day 4 – Parsing with Grammars (Book Extract)

The following is an extract from the book Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing, by Moritz Lenz, published 2017 by Apress Media, LLC. Reprinted with permission.

This books is being published right now. At least the ebook version should become available for purchase this month, and print version can be pre-ordered from Amazon. It should ship in January 2018 at the latest, but with a bit of luck, it’ll be available by Christmas.

Below you can find chapter 9, Parsing with Grammars. The previous chapters discuss the building blocks of regexes in great detail, how regexes interact with Perl 6 code, match objects, regex mechanics, common regex techhniques, and reusing and composing regexes. You can acquire some of this background by reading the official documentation on regexes.

Later chapters cover action classes and objects, how to report high-quality parse errors, Unicode support, and finally three case studies.

And now, enjoy!


Grammars are the proverbial Swiss-army chain saw1 for parsing.

In this chapter, we will explore them in more detail. Most importantly, we will discuss how to harness their power.

Understanding Grammars

Grammars implement a top-down approach to parsing. The entry point, usually the regex TOP, knows about the coarse-grained structure and calls further regexes that descend into the gory details. Recursion can be involved too. For example, if you parse a mathematical expression, a term can be an arbitrary expression inside a pair of parentheses.

This is a top-down structure, or more precisely a recursive descent parser. If no backtracking is involved, we call it a predictive parser, because at each position in the string, we know exactly what we’re looking for — we can predict what the next token is going to be (even if we can only predict that it might be one of a set of alternatives).

The resulting match tree corresponds in structure exactly to the call structure of regexes in the grammar. Let’s consider parsing a mathematical expression that only includes the operators *, +, and parentheses for grouping:

grammar MathExpression {
    token TOP    { <sum> }
    rule sum     { <product>+ %  '+' }
    rule product { <term>+ % '*' }
    rule term    { <number> | <group> }
    rule group   { '(' <sum> ')' }
    token number { \d+ }
}

say MathExpression.parse('2 + 4 * 5 * (1 + 3)');

From the grammar itself you can already see the potential for recursion: sum calls product, which calls term, which calls group, which calls sum again. This allows parsing of nested expressions of arbitrary depth.

Parsing the example above produces the following match object:

⌜2 + 4 * 5 * (1 + 3)⌟
 sum => ⌜2 + 4 * 5 * (1 + 3)⌟
  product => ⌜2 ⌟
   term => ⌜2 ⌟
    number => ⌜2⌟
  product => ⌜4 * 5 * (1 + 3)⌟
   term => ⌜4 ⌟
    number => ⌜4⌟
   term => ⌜5 ⌟
    number => ⌜5⌟
   term => ⌜(1 + 3)⌟
    group => ⌜(1 + 3)⌟
     sum => ⌜1 + 3⌟
      product => ⌜1 ⌟
       term => ⌜1 ⌟
        number => ⌜1⌟
      product => ⌜3⌟
       term => ⌜3⌟
        number => ⌜3⌟

If you want to know how a particular number was parsed, you can follow the path backwards by looking for lines above the current line that are indented less; for instance, the number 1 was parsed by token number, called from term, called from product, and so on.

We can verify this by raising an exception from token number:

    token number {
        (\d+)
        { die "how did I get here?" if $0 eq '1' }
    }

This indeed shows the call chain in the backtrace, with the most immediate context at the top:

how did I get here?
  in regex number at bt.p6 line 9
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex group at bt.p6 line 6
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex TOP at bt.p6 line 2
  in block <unit> at bt.p6 line 13

This grammar only uses tokens and rules, so there is no backtracking involved, and the grammar is a predictive parser. This is fairly typical. Many grammars work fine without backtracking, or with backtracking in just a few places.

Recursive Descent Parsing and Precedence

The MathExpression grammar has two rules which are structurally identical:

    rule sum { <product>+ %  '+' }
    rule product { <term>+ % '*' }

Instead, we could have written

rule  expression { <operator>+ % <term> }
token operator   {  '*' | '+' }

or even used the proto token construct discussed in the previous chapter to parse different operators. The reason I chose the first, more repetitive, approach is that it makes the match structure correspond to the precedence of the operators * and +.

When evaluating the mathematical expression 1 + 2 * 5, mathematicians and most programming languages evaluate the 2 * 5 first, because the * operator has tighter precedence than +. The result is then substituted back into the expression, leading to 1 + 10, and finally 11 as the result.

When parsing such expressions with the first version of the grammar, the structure of the parse tree expresses this grouping: it has — as the top level — a single sum, with the operands being 1 and 2 * 5.

This comes at a cost: we need a separate rule and name for each precedence level, and the nesting of the resulting match object has at least one level per precedence level. Furthermore, adding more precedence levels later on is not trivial, and very hard to do in a generic way. If you are not willing to accept these costs, you can instead use the flat model with a single token for parsing all operators. If you then need the structure in a way that reflects precedence, you can write code that transforms the list into a tree. This is commonly called an operator precedence parser.

Left Recursion and Other Traps

To avoid infinite recursion, you have to take care that each possible recursion cycle advances the cursor position by at least one character. In the MathExpression grammar, the only possible recursion cycle is sumproducttermgroupsum, and group can only match if it consumes an initial open parenthesis, (.

If recursion does not consume a character, it is called left recursion and needs special language support that Perl 6 does not offer. A case in point is

token a { <a>? 'a' }

which could match the same input as the regex a+, but instead loops infinitely without progressing.

A common technique to avoid left recursion is to have a structure where you can order regexes from generic (here sum) to specific (number). You only have to be careful and check for consumed characters when a regex deviates from that order (e.g. group calling sum).

Another potential source of infinite loops is when you quantify a regex that can match the empty string. This can happen when parsing a language that actually allows something to be empty. For instance, in UNIX shells, you can assign variables by potentially leaving the right-hand side empty:

VAR1=value
VAR2=

When writing a grammar for UNIX shell commands, it might be tempting to write a token string { \w* } that would potentially match an empty string. In a situation that allows for more than one string literal, <string>+ can then hang, because the effective regex, [\w*]+, tries to match a zero-width string infinitely many times.

Once you are aware of the problem, the solution is pretty simple: change the token to not allow an empty string (token string { \w+ }), and explicitly take care of situations where an empty string is allowed:

    token assignment {
        <variable> '=' <string>?
    }

Starting Simple

Even though a grammar works from the top down, developing a grammar works best from the bottom up. It is often not obvious from the start what the overall structure of a grammar will be, but you usually have a good idea about the terminal tokens: those that match text directly without calling other subrules.

In the earlier example of parsing mathematical expressions, you might not have known from the start how to arrange the rules that parse sums and products, but it’s likely that you knew you had to parse a number at some point, so you can start by writing:

grammar MathExpression {
    token number { \d+ }
}

This is not much, but it’s also not very complicated, and it’s a good way to get over the writer’s block that programmers sometimes face when challenged with a new problem area. Of course, as soon as you have a token, you can start to write some tests:

grammar MathExpression {
    token number { \d+ }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 2;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';
}

Now you can start to build your way up to more complex expressions:

grammar MathExpression {
    token number { \d+ }
    rule product { <number>+ % '*' }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 5;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';

    ok MathExpression.parse('1234', :rule<product>),
        '<product> can parse a simple number';
    ok MathExpression.parse('1*3*4', :rule<product>),
        '<product> can parse three terms';
    ok MathExpression.parse('1 * 3', :rule<product>),
        '<product> and whitespace';
}

It is worth it to include whitespace early on in the tests. The example above looks innocent enough, but the last test actually fails. There is no rule that matches the whitespace between the 1 and the *. Adding a space in the regex between the <number> and the + quantifier makes the tests pass again, because the whitespace inserts an implicit <.ws> call.

Such subtleties are easy to catch if you start really simple and catch them as soon as possible. If instead you give in to the temptation of writing down a whole grammar from top to bottom, you can spend many hours debugging why some seemingly simple thing such as an extra space makes the parse fail.

Assembling Complete Grammars

Once you have written the basic tokens for lexical analysis, you can progress to combining them. Typically the tokens do not parse whitespace at the borders of their matches, so the rules that combine them do that.

In the MathExpression example in the previous section, rule product directly called number, even though we now know that the final version uses an intermediate step, rule term, which can also parse an expression in parentheses. Introducing this extra step does not invalidate the tests we have written for product, because the strings it matched in the early version still match. Introducing more layers happens naturally when you start with a grammar that handles a subset of the language, which you later expand.

Debugging Grammars

There are two failure modes for a regex or a grammar: it can match when it’s not supposed to match (a false positive), or it can fail to match when it’s supposed to match (a false negative). Typically, false positives are easier to understand, because you can inspect the resulting match object and see which regexes matched which part of the string.

There is a handy tool for debugging false negatives: the Grammar::Tracer module. If you load the module in a file containing a grammar, running the grammar produces diagnostic information that can help you find out where a match went wrong.

Note that this is only a diagnostic tool for developers; if you want to give end users better error messages, please read Chapter 11 for improvement suggestions.

You need to install the Perl 6 module Grammar::Debugger, which also contains Grammar::Tracer. If you use the moritzlenz/perl6-regex-alpine docker image, this is already done for you. If you installed Perl 6 via another method, you need to run

zef install Grammar::Debugger

on the command line. If zef is not yet installed, follow the installation instructions on the zef GitHub page.

Let’s look at the Perl 6 module Config::INI by Tadeusz Sośnierz. It contains the following grammar (slightly reformatted here):

grammar INI {
    token TOP {
        ^ <.eol>* <toplevel>?  <sections>* <.eol>* $
            }
    token toplevel { <keyval>* }
    token sections { <header> <keyval>* }
    token header   { ^^ \h* '[' ~ ']' $<text>=<-[ \] \n ]>+
                     \h* <.eol>+ }
    token keyval   { ^^ \h* <key> \h* '=' \h* <value>? \h*
                     <.eol>+ }
    regex key      { <![#\[]> <-[;=]>+ }
    regex value    { [ <![#;]> \N ]+ }
    token eol      { [ <[#;]> \N* ]? \n }
}

Suppose we want to understand why it does not parse the following piece of input text:

a = b
[foo]
c: d

So, before the grammar, we insert the line

use Grammar::Tracer;

and after it, add a small piece of code that calls the .parse method of that grammar:

INI.parse(q:to/EOF/);
a = b
[foo]
c: d
EOF

This produces a sizable, but fairly informative piece of output.

Each entry consists of a name of a regex, like TOP or eol (for "end of line"), followed by the indented output of the regexes it calls. After each regex comes a line containing an asterisk (*) and either MATCH followed by the string segment that the regex matched, or FAIL if the regex failed.

Let’s look at the output piece by piece, even if it comes out in one chunk:

TOP
|  eol
|  * FAIL
|  toplevel
|  |  keyval
|  |  |  key
|  |  |  * MATCH "a "
|  |  |  value
|  |  |  * MATCH "b"
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "a = b\n"
|  |  keyval
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  * MATCH "a = b\n"

This tells us that TOP called eol, which failed to match. Since the call to eol is quantified with *, this does not cause the match of TOP to fail. TOP then calls key, which matches the text "a", and value, which matches "b". The eol regex then proceeds to match the newline character, fails on the second attempt (since there are no two newline characters in a row). This causes the initial keyval token to match successfully. A second call to keyval matches pretty quickly (in the call to key). Then, the match of token toplevel proceeds successfully, consuming the string "a = b\n".

So far this all looks as expected. Now let’s take a look at the second chunk of output:

|  sections
|  |  header
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "[foo]\n"
|  |  keyval
|  |  |  key
|  |  |  * MATCH "c: d\n"
|  |  * FAIL
|  * MATCH "[foo]\n"

TOP next calls sections, wherein token header successfully matches the string "[foo]\n". Then, keyval calls key, which matches the whole line "c: d\n". Wait, that’s not right, is it? We might have expected key to only match the c. I certainly wouldn’t have expected it to match a newline character at the end. The lack of an equals sign in the input causes the regex engine to never even call regex value. But since keyval is again quantified with the star * quantifier, the match of the calling regex sections succeeds in matching just the header "[foo]\n".

The last part of the Grammar::Tracer output follows:

|  sections
|  |  header
|  |  * FAIL
|  * FAIL
|  eol
|  * FAIL
* FAIL

It’s FAILs from here on. The second call to sections again tries to parse a header, but its next input is still "c: d\n", so it fails. As does the end-of-string anchor $ in token TOP, failing the overall match in method parse.

So we have learned that regex key matched the whole line c: d\n, but since no equals sign (=) follows it, token keyval cannot parse this line. Since no other regex (notably not header) matches it, this is where the match fails.

As you can see from this example run, Grammar::Tracer enables us to pinpoint where a parse failure happens, even though we had to look carefully through its output to locate it. When you run it in a terminal, you automatically get colored output, with FAIL having a red and MATCH a green background, and token names standing out in bold white (instead of the usual gray) output. This makes it easier to scan from the bottom (where a failed match usually leaves a trail of red FAILs) up to the trailing successful matches, and then look in the vicinity of the border between matches and failures.

Since debugging imposes a significant mental burden, and the output from Grammar::Tracer tends to grow quickly, it is generally advisable to reduce the failing input to a minimal specimen. In the case described above, we could have removed the first line of the input string and saved ten lines of Grammar::Tracer output to look through.

Parsing Whitespace and Comments

As said before, the idiomatic way to parse insignificant whitespace is by calling <.ws>, typically implicitly by using whitespace in a rule. The default ws implementation, <!ww>\s*, works well for many languages, but it has its limits.

In a surprising number of file formats and computer languages, there is significant whitespace that <.ws> would just gobble up. These include INI files (where a newline typically indicates a new key/value pair), Python and YAML (where indentation is used for grouping), CSV (where a newline signals a new record), and Makefiles (where indentation is required to be with a tabulator character).

In these cases, it is best practice to override ws in your own grammar to match only insignificant whitespace. Let’s take a look at a second, minimalistic INI parser, independently developed from the one described in the previous section:

grammar INIFile {
    token TOP { <section>* }
    token section {
        <header>
        <keyvalue>*
    }
    rule header {
        '['  <-[ \] \n ]>+ ']' <.eol>
    }
    rule keyvalue {
        ^^
        $<key>=[\w+]
        <[:=]>
        $<value>=[<-[\n;#]>*]
        <.eol>
    }
    token ws { <!ww> \h* }
    token eol {
        \n [\h*\n]*
    }
}

This parses simple INI configuration files like this:

[db]
driver: mysql
host: db01.example.com
port: 122
username: us123
password: s3kr1t

Take note how this grammar uses two paths for parsing whitespace: a custom ws token that only matches horizontal whitespace (blanks and tabs), and a separate token eol that matches (significant) line breaks. The eol token also gobbles up further lines consisting only of whitespace.

If a language supports comments, and you don’t want them to appear in your parse tree, you can parse them either in your ws token, or in eol (or your equivalent thereof). Which one it is depends on where comments are allowed. In INI files, they are only allowed after a key-value pair or in a line on their own, so eol would be the fitting place. In contrast, SQL allows comments in every place where whitespace is allowed, so it is natural to parse them in ws:

# comment parsing for SQL:
token ws { <!ww> \s* [ '--' \N* \n ]* }

# comment parsing for INI files:
token eol { [ [ <[#;]> \N* ]? \n ]+ }

Keeping State

Some of the more interesting data formats and languages require the parser to store things (at least temporarily) to be able to correctly parse them. A case in point is the C programming language, and others inspired by its syntax (such as C++ and Java). Such languages allow variable declarations of the form type variable = initial_value, like this:

int x = 42;

This is valid syntax, but only if the first word is a type name. In contrast, this would be invalid, because x is not a type:

int x = 42;
x y = 23;

From these examples it is pretty clear that the parser must have a record of all the types it knows. Since users can also declare types in their code files, the parser must be able to update this record.

Many languages also require that symbols (variables, types, and functions) be declared before they are referenced. This too requires the grammar to keep track of what has been declared and what hasn’t. This record of what has been declared (and what is a type or not, and possibly other meta information) is called a symbol table.

Instead of parsing the full C programming language, let’s consider a minimalist language that just allows assignments of lists of numbers, and variables to variables:

a = 1
b = 2
c = a, 5, b

If we don’t impose declaration rules, it’s pretty easy to write a grammar:

grammar VariableLists {
    token TOP        { <statement>* }
    rule  statement  { <identifier> '=' <termlist> \n }
    rule  termlist   { <term> * % ',' }
    token term       { <identifier> | <number> }
    token number     { \d+ }
    token identifier { <:alpha> \w* }
    token ws         { <!ww> \h* }
}

Now we demand that variables can only be used after they’ve been assigned to, so that the following input would be invalid, because b is not declared in the second line, where it’s used:

a = 1
c = a, 5, b
b = 2

To maintain a symbol table, we need three new elements: a declaration of the symbol table, some code that adds a variable name to the symbol table when the assignment has been parsed, and finally a check whether a variable has been declared at the time we come across it in a term list:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

In the token TOP, :my %*SYMBOLS declares a variable. Declarations in regexes start with a colon (:), and end with a semicolon (;). In between they look like normal declarations in Perl 6. The % sigil signals that the variable is a hash — a mapping of string keys to values. The * makes it a dynamic variable — a variable that is not limited to the current scope but also visible to code (or regexes, which are also code) that is called from the current scope. Since this is an unusually large scope, it is custom to choose a variable in CAPITAL LETTERS.

The second part, adding a symbol to the symbol table, happens in the rule statement:

    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }

Inside the curly braces is regular (non-regex) Perl 6 code, so we can use it to manipulate the hash %*SYMBOLS. The expression $<identifier> accesses the capture for the variable name2. Thus, if this rule parses a variable a, this statement sets %*SYMBOLS{ 'a' } = True.

The placement of the code block is relevant. Putting it before the call to termlist means that the variable is already known when the term list is parsed, so it accepts input like a = 2, a. If we call termlist first, this kind of input is rejected.

Speaking of rejection, this part happens in token variable. term now calls the new token variable (previously it called identifier directly), and variable validates that the symbol has been declared before:

    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }

You might remember from earlier examples that <?{ ... }> executes a piece of Perl 6 code, and fails the parse if it returns a false value. If $<identifier> is not in %*SYMBOLS, this is exactly what happens. At this time the non-backtracking nature of tokens is important. If the variable being parsed is abc, and a variable a is in %*SYMBOLS, backtracking would try shorter matches for <identifier> until it hits a, and then succeeds3.

Since %*SYMBOLS is declared in token TOP, you have to duplicate this declaration when you try to call rules other than TOP from outside the grammar. Without a declaration such as my %*SYMBOLS;, a call like

VariableLists.parse('abc', rule => 'variable');

dies with

Dynamic variable %*SYMBOLS not found

Implementing Lexical Scoping with Dynamic Variables

Many programming languages have the concept of a lexical scope. A scope is the area in a program where a symbol is visible. We call a scope lexical if the scope is determined solely by the structure of the text (and not, say, runtime features of the program).

Scopes can typically be nested. A variable declared in one scope is visible in this scope, and in all inner, nested scopes (unless an inner scope declares a variable of the same name, in which case the inner declaration hides the outer).

Coming back to the toy language of lists and assignments, we can introduce a pair of curly braces to denote a new scope, so this is valid:

a = 1
b = 2
{
    c = a, 5, b
}

but the next example is invalid, because it declares b only in an inner scope, and so it is not visible in the outer scope:

a = 1
{
    b = 2
}
c = a, 5, b

To implement these rules in a grammar, we can make use of an important observation: dynamic scoping in a grammar corresponds to lexical scoping in text it parses. If we have a regex block that parses both the delimiters of a scope and the things inside that scope, its dynamic scope is confined to all of the regexes it calls (directly and indirectly), and that is also the extent of the lexical scope it matches in the input text.

Let’s take a look at how we can implement dynamic scoping:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True; }
        '=' <termlist>
        \n
    }
    rule block {
        :my %*SYMBOLS = CALLERS::<%*SYMBOLS>;
        '{' \n*
            <statement>*
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

There are a few changes to the previous version of this grammar: the rule statement has been renamed to declaration and the new rule statement parses either a declaration or a block.

All the interesting bits happen in the block rule. The line :my %*SYMBOLS = CALLERS::<%*SYMBOLS>; declares a new dynamic variable %*SYMBOLS and initializes it with the previous value of that variable. CALLERS::<%*SYMBOLS> looks through the caller, and the caller’s caller, and so on for a variable %*SYMBOLS, and thus looks up the value corresponding to the outer scope. The initialization creates a copy of the hash, such that changes to one copy do not affect the other copies.

Let’s take a look at what happens when this grammar parses the following input:

a = 1
b = 2
{
    c = a, 5, b
}

After the first two lines, %*SYMBOLS has the value {a => True, b => True}. When rule block parses the opening curly bracket on the third line, it creates a copy of %*SYMBOLS. The declaration of c on the fourth line inserts the pair c => True into the copy of %*SYMBOLS. After rule block parses the closing curly brace on the last line, it exits successfully, and the copy of %*SYMBOLS goes out of scope. This leaves us with the earlier version of %*SYMBOLS (with only the keys a and b), which then goes out of scope when TOP exits.

Scoping Through Explicit Symbol Tables

Using dynamic variables for managing symbol tables usually works pretty well, but there are some edge cases where a more explicit approach works better. Such edge cases include those where there are so many symbols that copying becomes prohibitively expensive, or where more than the top-most scope must be inspected, or when copying the symbol table is impractical for other reasons.

Consequently, you can write a class for your symbol table (which in the simplest case uses an array as a stack of scopes) and explicitly call methods on it when entering and leaving scopes, when declaring a variable, and for checking whether a variable is known in a scope:

class SymbolTable {
    has @!scopes = {}, ;
    method enter-scope() {
        @!scopes.push({})
    }
    method leave-scope() {
        @!scopes.pop();
    }
    method declare($variable) {
        @!scopes[*-1]{$variable} = True
    }
    method check-declared($variable) {
        for @!scopes.reverse -> %scope {
            return True if %scope{$variable};
        }
        return False;
    }
}

grammar VariableLists {
    token TOP {
        :my $*ST = SymbolTable.new();
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { $*ST.declare( $<identifier> ) }
        '=' <termlist>
        \n
    }
    rule block {
        '{' \n*
            { $*ST.enter-scope() }
            <statement>*
            { $*ST.leave-scope() }
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ $*ST.check-declared($<identifier>) }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

The class SymbolTable has the private array attribute @!scopes which is initialized with a list containing a single, empty hash {}. Entering a scope means pushing an empty hash on top of this array, and when leaving the scope it is removed again through the pop method call. A variable declaration adds its name to the top-most hash, @!scopes[*-1].

Checking for the presence of a variable must not just consider the top-most hash, because variables are inherited to inner scopes. Here we go through the all scopes in reverse order, from inner-most to outer-most scope. The order of traversal is not relevant for a simple Boolean check, but if you need to look up information associated with the variable, it is important to adhere to this order to reference the correct one.

Token TOP creates a new object of class SymbolTable, declaration calls the declare method, and token variable calls method check-declared. The rule block calls enter-scope before parsing the statement list, and leave-scope afterwards. This works, but only if the statement list can be parsed successfully; if not, rule block fails before it manages to call leave-scope.

Perl 6 has a safety feature for such situations: if you prefix a statement with LEAVE, Perl 6 calls it for you at routine exit, in all circumstances where this is possible (even if an exception is thrown). Since the LEAVE phaser only works in regular code and not in regexes, we need to wrap the regex in a method:

    method block {
        $*ST.enter-scope();
        LEAVE $*ST.leave-scope();
        self.block_wrapped();
    }
    rule block_wrapped {
        '{' \n*
            <statement>*
        '}' \n*
    }

Now we have the same robustness as the approach with dynamic variables, and more flexibility to add extra code to the symbol table, at the cost of more code and increased effort.

Summary

Perl 6 grammars are a declarative way to write recursive descent parsers. Without backtracking, they are predictive; at each point, we know what list of tokens to expect.

The recursive nature of grammars comes with the risk of left recursion, a situation where a recursive path does not consume any characters, and so leads to an infinite loop.

In spite of the top-down nature of grammars, writing them typically happens from the bottom up: starting with lexical analysis, and then moving up to parsing larger structures.

Complex languages require additional state for successful and precise parsing. We have seen how you can use dynamic variables to hold state in grammars, how their scope corresponds to lexical scoping in the input, and how symbol tables can be written and integrated into grammars.


  1. Like a Swiss-army knife, but with much more power.

  2. At this point it is crucial that identifier does not parse its surrounding whitespace. Hence the principle that tokens do not care about whitespace, and the rules that call those tokens parse the whitespace.

  3. In this case, this would be harmless, because no other rule could match the rest of the variable, leading to a parse error nonetheless. But in more complicated cases, this kind of unintended backtracking can lead to errors that are very puzzling for the maintainer of the grammar.

Day 6 – Perl 6 Books — the Time is Ripe

One question we occasionally get on the phenomenal #perl6 IRC channel is about Perl 6 books. It turns out, some folks don’t want to learn just from tutorials, blog posts and docs. Last year, we didn’t have any good answers to that question.

A few months ago, there seemed to be a flurry of such questions, and at the same time, rumours about new book projects started to spread.

If I remember correctly, the first rumour was when Laurent Rosenfeld contacted me in June, asking for feedback on a manuscript. He had translated a good 200 pages of the Think Python book to Perl 6. This is primarily a book teaching programming to beginners. Later I was happy to hear that a major publisher has accepted the manuscript. The manuscript is now finished, and under review. So I guess we’ll see e-book and print versions of this book in due time.

Then brian d foy opened a kickstarter to raise funds for a “Learning Perl 6” book. By the time this is published, the kickstarter project should still be open, so I warmly encourage you to support this. Having a book by a prominent member of the Perl 5 community, and author of several good Perl books would surely be a boon for the Perl 6 community.

Before the publication of brian’s project, I’ve been mulling about writing my own Perl 6 book. In the past, I’ve tried that once already, and failed. But I had more success with another book project of mine, so I decided to give it another shot. The working title is Perl 6 by example. Content for the book is published in form of blog posts, and later turned into book chapters.

Later I learned that Ken Youens-Clark had written a book on metagenomics that spends about 150 pages explaining Perl 6. And it’s free, you can download the PDF, epub and mobi versions right away! If you’ve followed the advent calendar, you might have read Ken’s earlier post on Bioinformatics

In summary, we have one book starring Perl 6 content, one in progress with the manuscript in stage “feature complete”, one in the kick-starting phase which you can support, and a fourth being written right now, with parts being pre-published in the form of blog posts.

If you want to keep up-to-date with news on Perl 6 books, you can sign up for the low volume Perl 6 book mailing list (less than one mail per month).

I hope that in next year’s advent calendar, we’ll see four reviews of Perl 6 books.

Day 1 – The State of Perl 6 in 2015

Please fasten your seat belt for your annual dive into Perl 6.

As has been customary the last six years, we start with a brief overview of the state of Perl 6.

Last year's State of Perl 6 anticipated an upcoming production release of Perl 6. That is still scheduled for this Christmas. Last year's summary also identified major areas of work to-be-done for this release.

The 2015.05 release of Rakudo introduced NFG or "Normal Form Grapheme", which means that strings in Perl 6 are not based on Unicode codepoints anymore, and instead on grapheme clusters. Grapheme clusters can consist of base characters, like the latin lower-case c, and combining characters, like the "combining cedilla" character. Together, they form a single, visual character "c with cedilla", ç. This character also exists as a single codepoint, but other combinations of base character and combining character don't. Yet with this new feature, Perl 6 still treats the cluster of base character plus one (or more) combining characters as a single character, so regexes matches and substr won't tear them apart.

In September, Rakudo shipped with the GLR or Great List Refactoring in place. This mostly means that the rules for using and accessing nested data structures are now much simpler and more consistent. Under the hood we also have a sane and powerful iterator model, and a new type Seq for lazy value streams that don't necessarily memorize old values on iteration.

Finally, the September release introduced native, shaped arrays (or NSA). This allows you to write

    my int32 @matrx[4;5]

which allocates a continuous block of 20 32-bit values, but is still usable as a two-dimensional matrix in Perl 6 land. This paves the road towards memory efficient linear algebra (and other applications, of course).

But, development didn't stop there. The upcoming December release of Rakudo brings us automatic precompilation of modules and installation repositories.

Not only the compiler progresses. Where the Perl 6 Modules page showed around 270 modules a year ago (end of November 2014), we are now at about 460 modules. I'm also happy to report that there are two module installers now, panda and zef.

We also now have decent documentation coverage, at least on built-in types; other areas such as tutorials and material on language features are still a bit sparse. Other documentation efforts such as Learn Perl 6 in Y minutes and perl6intro.com have sprung up, and enjoy popularity.

I'm sure there has been more Perl 6 activity that would be worth reporting, but the excitment for the upcoming release makes a calm retrospective a bit harder than usual.

Stay tuned for more, and have fun!

Day 23 – Webscale sorting of the sleepy kind

In the emerging tradition of writing about sort on the 23rd, let me elaborate on a particularly silly kind of sorting algorithm: sleepsort.

The idea is that to sort a list of numbers, you spawn a separate process (or thread), sleep for as long as the number specifies, and then print the result. That way the numbers are printed in ascending order.

Rakudo on the MoarVM backend has threads, so let's put them to good use:

 use v6;
 
 my @unsorted = (1..10).pick(5);
 
 say "Unsorted: @unsorted[]";
 
 await @unsorted.map: -> $t {
     start {
         sleep $t;
         say $t;
     }
 };

If you run it, you should get output like this:

 $ ./perl6-m sleepsort-01.p6
 Unsorted: 1 7 5 8 3
 1
 3
 5
 7
 8

How did we get there? (1..10).pick(5) randomly picks (without replacing) five numbers from 1 to 10. start { ... } starts a thread and returns a Promise object. We do that for all the numbers, and await blocks until all promises that have passed to it are fullfilled (or broken, but that won't happen here). Or in other words: until all the newly spawned threads have finished.

Now let's make it web scale! Well, sort (sic) of.

First an obvious improvement is to not sleep for the whole second. Let's just scale it with a constant factor: sleep $t / 10, and we're about ten times as fast!

But there's another scaling problem: If you want to sort hundreds of thousands of numbers with sleep sort (you'd do that, right?), our program would spawn as many threads. So lots of memory wasted. Also sleep maps pretty much directly to the underlying libc call, and thus blocks the thread. We can do better:

 use v6;
 
 my @unsorted = (1..10).pick(5);
 
 await @unsorted.map: -> $t {
     Promise.in($t / 10 ).then({ say $t });
 };

Promise.in($s) creates a Promise that will be kept in $s seconds. .then creates a chained promise whose block is executed when the first one is kept.

This also removes the need to spawn processes explicitly. The standard scheduler takes care of that. We're basically web scale!

Now it's also time to address an issue of correctness. The sleepsort algorithm is a kind of divide-and-conquer approach, but there's no joining of the results. There is no single point in the program that can access the entire the sorted list. To fix that, we use a Channel, a thread-safe queue:

 use v6;
 my @unsorted = (1..10).pick(5);
 
 my $channel = Channel.new;
 
 await @unsorted.map: -> $t {
     Promise.in($t / 10 ).then({ $channel.send($t) });
 };
 $channel.close;
 
 say $channel.list;

This prints the sorted list all from the main thread.

So now it can be encapsulated in a nice subroutine.

 sub sleepsort(*@values, :$factor) {
     my $channel = Channel.new;
 
     await @values.map: -> $t {
         Promise.in($t * $factor ).then({ $channel.send($t) });
     };
     $channel.close;
 
     $channel.list;
 }

Another issue of correctness remains: Both sleep and Promise.in only specify a minimal time to be slept; implementation-dependent, longer sleep times are possible. If $factor is too small, the promises might executed out of the desired order.

So let's find the minimal factor with which it still works:

 my $previous = Inf;
 for 0.1, */1.5 ... * -> $factor {
     say "Trying $factor";
     my $success = [<=] sleepsort(:$factor, @unsorted);
     say $success ?? 'Success' !! 'Failed';
     unless $success {
         say "Last good factor: ", $previous;
         last;
     }
     $previous = $factor;
 }

On my machine, this produces the following output:

 Trying 0.1
 Success
 Trying 0.066667
 Success
 Trying 0.044444
 Success
 Trying 0.029630
 Success
 Trying 0.019753
 Success
 Trying 0.013169
 Success
 Trying 0.008779
 Success
 Trying 0.005853
 Failed
 Last good factor: 0.008779

So a :$factor of around 0.01 or 0.09 seems to work on my setup.

Your output/mileage may vary.

Over and out, sleeping until Christmas Eve.

Day 10 – Introspecting the Symbol Tables

Perl 6 is designed with extensibility in mind. And when you want to extend something, you often need to know as much as possible about the environment.

Today we’ll look at an aspect of finding out about the environment: introspecting symbol tables.

A symbol is something you refer to by name, so it could be a package, class, variable, routine, constant etc.

A symbol table is a data structure in which symbols can be looked up. Perl 6 has three main types of symbol tables, differentiated by scope: lexical symbols live in the lexical pad, package symbols live in a Stash, and methods in a method table.

Enough theory, time for action:

    $ perl6 -e 'my $x = 42; say MY::.keys'
    $x $! $/ $_ GLOBALish EXPORT $?PACKAGE ::?PACKAGE $=pod !UNIT_MARKER

MY is a pseudo package representing the current scope, and appending two colons gives us its symbol table. Which in turn roughly behaves like a Hash, so we can use a method like keys to find all symbols in that table. Or look up a string there:

    $ perl6 -e 'my $x = 42; say MY::<$x>'
    42

A complicated way to say $x, but you can use that to lookup symbols by name:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say MY::{$var}'
    42

Or if you don’t care if it comes from the current lexical scope, just from somewhere:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say ::($var)'
    42

No need to tell you that in general, this is a bad idea; just use it when you have exceptional circumstances.

But wait, what were all those other symbols from the first line? $!, $/ and $_ are the three “special” variables for errors, matches and the current topic; the rest is stuff that doesn’t need to be in lexical scope, but the compiler puts it there because it makes stuff easier.

Outer lexicals are accessible through the OUTER:: pseudo package:

    $ perl6 -e 'my $x = 1; { my $x = 2; say OUTER::<$x>; say $x }'
    1
    2

Instead of OUTER::<$x>, you can also say $OUTER::x. In general, you can move the sigil to the front and omit the angle brackets if the symbol name is a literal.

Package declarations and explicitly our-scoped symbols generally go into a stash, with the top-most being GLOBAL::

    class A {
        class B { }
    }
    say GLOBAL::<A>;    # (A)
    say A;              # (A)
    say A::.keys;       # B

Not only the double colon gives you the symbol table, you can also get it with .WHO (which also works for non-literals). So a long way to write A::B is

    say A::B === GLOBAL::<A>.WHO<B>;    # True

Finally, methods live in a classes or roles, and you can use the meta object to introspect them; those generally respect inheritance and role composition, so they are a bit more complicated than stashes.

    my $m = Str.^can('split')[0];
    say $m.^name;               # Method
    say $m('a1B', /\d/).perl;    # ("a", "B").list

Here .^can(name) returns a list of methods on that object with a given name. It’s a list, because it includes methods from superclasses, so there can be more than one. To get a list of all available methods, you can use .^methods and .^methods(:all):

    $ perl6-m -e 'say  Str.^methods(:all).grep: {.name ~~ /^m/ }'
    match match map min max minmax

.^methods(:all) includes methods from all superclasses, without the :all, methods from classes Cool, Any and Mu are excluded.

This concludes our short tour through the land of the symbol table. There is more to see (for example the EXPORT package, and various pseudo packages), and the daring can go to the design documents or play with Rakudo’s implementation, which is fairly robust.

Day 01 – The State of Perl 6 in 2014

Welcome to the 6th annual edition of the Perl 6 advent calendar, code name 2014.

In 2014, MoarVM has become the de facto standard backend for the Rakudo Perl 6 compiler. Parrot and JVM are still supported as backends, but MoarVM provides much lower memory usage, faster startup time, and is significantly faster than parrot at runtime.

Rakudo on MoarVM also has concurrency support, including reactive programming.

Much work has been done to improve performance, some in MoarVM, some in Rakudo and NQP. Maybe most notably is the JIT compiler for MoarVM that came out of the Google Summer of Code project by Bart Wiegmans.

Another Google Summer of Code project by Filip Sergot brought us HTTP::UserAgent, a versatile HTTP client library with SSL/TLS support.

During the Austrian Perl Workshop in fall 2014, many of the Perl 6 and Rakudo core contributors met, and identified three roadblocks for a "final" Perl 6.0 release: GLR, NFG and NSA.

Here GLR stands for the Grand List Refactoring, a plan to make the list-y types more transparent, list iteration faster, and more obvious when a list will flatten.

NFG is the Normal Form Grapheme, a plan for implementing grapheme-based string indexing.

Finally, our NSA has nothing to do with surveillance. Natively Shaped Arrays are a flexible feature for declaring typed, potentially multi-dimensional arrays, potentially with pre-defined dimensions. It will make memory-efficient matrix storage and operations possible, for example.

With these three blockers defined, Larry Wall submitted a talk to FOSDEM called Get ready to party!, predicting that 2015 will be the year that Perl 6 will get a production release.

Sorry, this was meant to become a summary of the current state of Perl 6, and it derailed into an outlook. To allow me to keep the "state" in the title, let me just tell you that Rakudo on the MoarVM is quite fun to work with. It's fast enough for small (and sometimes even mid-sized) tasks, module installation works fairly reliably, and the range of available modules has also increased.

Also I feel that any attempt to summarize the progress of this awesome community is bound to be very incomplete; I hope that my fellow Perl 6 hackers will fill in some details in the upcoming 23 posts.

Have a nice pre-Christmas time, and enjoy the show!

Day 23 – Unary Sort

Most languages or libraries that provide a generic sort routine allow you to specify a comparator, that is a callback that tells the sort routine how two given elements compare. Perl is no exception.

For example in Perl 5, which defaults to lexicographic ordering, you can request numeric sorting like this:

 use v5;
 my @sorted = sort { $a <=> $b } @values;

Perl 6 offers a similar option:

 use v6;
 my @sorted = sort { $^a <=> $^b }, @values;

The main difference is that the arguments are not passed through the global variables $a and $b, but rather as arguments to the comparator. The comparator can be anything callable, that is a named or anonymous sub or a block. The { $^a <=> $^b} syntax is not special to sort, I have just used placeholder variables to show the similarity with Perl 5. Other ways to write the same thing are:

 my @sorted = sort -> $a, $b { $a <=> $b }, @values;
 my @sorted = sort * <=> *, @values;
 my @sorted = sort &infix:«<=>», @values;

The first one is just another syntax for writing blocks, * <=> * use * to automatically curry an argument, and the final one directly refers to the routine that implements the <=> "space ship" operator (which does numeric comparison).

But Perl strives not only to make hard things possible, but also to make simple things easy. Which is why Perl 6 offers more convenience. Looking at sorting code, one can often find that the comparator duplicates code. Here are two common examples:

 # sort words by a sort order defined in a hash:
 my %rank = a => 5, b => 2, c => 10, d => 3;
 say sort { %rank{$^a} <=> %rank{$^b} }, 'a'..'d';
 #          ^^^^^^^^^^     ^^^^^^^^^^  code duplication

 # sort case-insensitively
 say sort { $^a.lc cmp $^b.lc }, @words;
 #          ^^^^^^     ^^^^^^  code duplication

Since we love convenience and hate code duplication, Perl 6 offers a shorter solution:

 # sort words by a sort order defined in a hash:
 say sort { %rank{$_} }, 'a'..'d';

 # sort case-insensitively
 say sort { .lc }, @words;

sort is smart enough to recognize that the code object code now only takes a single argument, and now uses it to map each element of the input list to new values, which it then sorts with normal cmp sort semantics. But it returns the original list in the new order, not the transformed elements. This is similar to the Schwartzian Transform, but very convenient since it's built in.

So the code block now acts as a transformer, not a comparator.

Note that in Perl 6, cmp is smart enough to compare strings with string semantics and numbers with number semantics, so producing numbers in the transformation code generally does what you want. This implies that if you want to sort numerically, you can do that by forcing the elements into numeric context:

 my @sorted-numerically = sort +*, @list;

And if you want to sort in reverse numeric order, simply use -* instead.

The unary sort is very convenient, so you might wonder why the Perl 5 folks haven't adopted it yet. The answer is that since the sort routine needs to find out whether the callback takes one or two arguments, it relies on subroutine (or block) signatures, something not (yet?) present in Perl 5. Moreover the "smart" cmp operator, which compares number numerically and strings lexicographically, requires a type system which Perl 5 doesn't have.

I strongly encourage you to try it out. But be warned: Once you get used to it, you'll miss it whenever you work in a language or with a library that lacks this feature.