Day 12 – The Year of Perl 6 Books

We can quibble all day if 2017 was the year of the Linux desktop, but there can be little doubt that it was the year of the Perl 6 book.

Perl 6 at a Glance

December 2016 brought us the ebook launch of Perl 6 at a Glance by Andrew Shitov, and then in 2017 the print version came it. It is an introduction to Perl 6 that targets programmers already familiar with another language.

It is the first of a generation of “modern” Perl 6 books. There weren’t many Perl 6 books before, most notably “Perl 6 and Parrot Essentials”, which was written when Perl 6 was very much a language in flux. The December 2015 release of the Perl 6 language version v6.c (and the accompanying Rakudo Perl 6 compiler) finally offers enough stability to make Perl 6 books work.

Think Perl 6

The next book released in 2017 was “Think Perl 6: How to Think Like a Computer Scientist”. It is a Perl 6 adaptation of Allen Downey’s great book Think Python: How to Think Like a Computer Scientist, lovingly ported to Perl 6 by Laurent Rosenfeld. It is available in print from O’Reilly, and freely available as an ebook from Green Tea Press. It is also available under an Open Source license in its source form (LaTeX) on GitHub.

“Think Perl 6” is an introduction to programming and computer science that happens to use Perl 6 as its primary tool. It targets the absolute beginner, and goes into a lot of detail on basic concepts such as branches, loops, variables, expressions, functions, recursion and so on.

Learning to Program with Perl 6

A book I haven’t had on my radar until it was available for purchase on Amazon was Learning to program with Perl 6: First Steps: Getting into programming without leaving the command line by JJ Merelo. You can buy it on Amazon pretty cheaply, or check it out on GitHub, where you can find a musical as bonus material.

It mostly targets beginners, and also discusses some things related to programming, like the use of GitHub, some shell features, and SSH. It is a light-hearted introduction into computing and Perl 6.

Perl 6 Fundamentals

Perl 6 Fundamentals started its life as “Perl 6 by Example”, written by Moritz Lenz, aka yours truly. (Yes, authors write about themselves in the third person. That “About the Author” section in each book? Written by the author. In third person. Weird). When Apress acquired the book, it was renamed to Perl 6 Fundamentals: A Primer with Examples, Projects, and Case Studies. It is available from everywhere that you can buy books. At least I hope so :-)

Each chapter focuses on one (at least somewhat) practical example, and uses that as an excuse to talk about various Perl 6 features, including concurrency, functional programming, grammars, and calling Python libraries through Inline::Python. (You can read the chapter about Inline::Python over at perltricks.com.) It targets programmers with previous experience, though not necessarily Perl 6 (or Perl 5) experience.

Larry Wall has kindly written a foreword for the book.

Perl 6 Deep Dive

Andrew Shitov’s second Perl 6 book, Perl 6 Deep Dive is, as the name suggests, more comprehensive and a, well, deeper dive, than “Perl 6 at Glance”, though somewhat similar in style. With more than 350 pages, it seems to have the largest coverage of Perl 6 features of any book so far.

Using Perl 6

Guess who’s released a third Perl 6 book within one year? That’s right, Andrew Shitov again. Somebody give that man a medal! Using Perl 6 is a collection of 100 programming challenges/problems and their solution in Perl 6. This is what Andrew wrote about it:

About a year ago, I decided to write a book about using Perl 6. Later, the plans changed and I published “Perl 6 at a Glance”, after which I wanted to write “Migrating to Perl 6” but instead wrote “Perl 6 Deep Dive” for Packt Publishing. Here and there, I was giving trainings on Perl 5, Python, and JavaScript, and was always suffering from finding a good list of easy tasks that a newcomer can use to train their skills in the language. Finally, I made up the list and solved it in Perl 6. This is the content of “Using Perl 6” — a book with solutions to 100 programming challenges, from simple to intermediate, with explanations of the Perl 6 constructions used.

Since his fourth book, *Migrating to Perl 6 will be released in 2018, it doesn’t get its own section. Take that, Andy! :-) This is the “Perl 6 for Perl 5 programmers” book that people (Perl 5 people, mostly) have been asking for on IRC and some other media.

And of course I won’t mention Andrew’s kickstarter for a cookbook-style project, because that will further skew the stats. Ooops, I just did. Hmm. Well, go support the man!

Parsing with Perl 6 Regexes and Grammars

After writing a general Perl 6 book, I wanted to focus on a narrower topic. A non-representative poll on twitter confirmed my suspicion that regexes and grammars would be the best niche, and so Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing was born.

It requires basic programming knowledge to read, but no prior exposure to regexes to Perl. It goes from the building blocks of regexes to fully-featured parsers, including Abstract Syntax Tree generation and error reporting. A discussion of three very different example parsers concludes the nearly 200 pages, which could have also been titled far more than you ever wanted to know about parsing with Perl 6.

Right now, the ebook version is available for purchase, and I hope that the print version will be ready by Christmas. (And I’m talking about Christmas 2017, to be sure :)

Books in the Pipeline

I’d be remiss if I didn’t point out two more books that aren’t available yet, but might be in the next months or years.

brian d foy works on Learning Perl 6. In his last update, he shares that the first draft of the book is written, with a plan for things that need rewriting.

Gabor Szabo crowd-funded a book on Web Application Development in Perl 6 using the Bailador framework. The earlier chapters are mostly fleshed out, and the later chapters mostly exist as skeletons. Gabor expects it to be finished in 2018.

Keeping Track

The flood of Perl 6 books has made it hard for newcomers to decide which book to read, so I created the https://perl6book.com/ website that has one-line summaries, and a flow-chart for deciding which book to buy.

Even though I had input from other Perl 6 authors, it certainly reflects my biases. But, you can help to improve it!

Summary

With 7 Perl 6 books published by three major publishers in 2017, it’s been a fantastic year. I am also very happy with the diversity of the books, their target audience and styles. I hope you are too!

A final plea: If you have read any of these books, please give the author some feedback. They put incredible amounts of work into those, and feedback helps the author’s learning process and motivation. And if you liked a book, maybe even give it 5 stars on Amazon and write a line or two about why you liked it.

Day 4 – Parsing with Grammars (Book Extract)

The following is an extract from the book Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing, by Moritz Lenz, published 2017 by Apress Media, LLC. Reprinted with permission.

This books is being published right now. At least the ebook version should become available for purchase this month, and print version can be pre-ordered from Amazon. It should ship in January 2018 at the latest, but with a bit of luck, it’ll be available by Christmas.

Below you can find chapter 9, Parsing with Grammars. The previous chapters discuss the building blocks of regexes in great detail, how regexes interact with Perl 6 code, match objects, regex mechanics, common regex techhniques, and reusing and composing regexes. You can acquire some of this background by reading the official documentation on regexes.

Later chapters cover action classes and objects, how to report high-quality parse errors, Unicode support, and finally three case studies.

And now, enjoy!


Grammars are the proverbial Swiss-army chain saw1 for parsing.

In this chapter, we will explore them in more detail. Most importantly, we will discuss how to harness their power.

Understanding Grammars

Grammars implement a top-down approach to parsing. The entry point, usually the regex TOP, knows about the coarse-grained structure and calls further regexes that descend into the gory details. Recursion can be involved too. For example, if you parse a mathematical expression, a term can be an arbitrary expression inside a pair of parentheses.

This is a top-down structure, or more precisely a recursive descent parser. If no backtracking is involved, we call it a predictive parser, because at each position in the string, we know exactly what we’re looking for — we can predict what the next token is going to be (even if we can only predict that it might be one of a set of alternatives).

The resulting match tree corresponds in structure exactly to the call structure of regexes in the grammar. Let’s consider parsing a mathematical expression that only includes the operators *, +, and parentheses for grouping:

grammar MathExpression {
    token TOP    { <sum> }
    rule sum     { <product>+ %  '+' }
    rule product { <term>+ % '*' }
    rule term    { <number> | <group> }
    rule group   { '(' <sum> ')' }
    token number { \d+ }
}

say MathExpression.parse('2 + 4 * 5 * (1 + 3)');

From the grammar itself you can already see the potential for recursion: sum calls product, which calls term, which calls group, which calls sum again. This allows parsing of nested expressions of arbitrary depth.

Parsing the example above produces the following match object:

⌜2 + 4 * 5 * (1 + 3)⌟
 sum => ⌜2 + 4 * 5 * (1 + 3)⌟
  product => ⌜2 ⌟
   term => ⌜2 ⌟
    number => ⌜2⌟
  product => ⌜4 * 5 * (1 + 3)⌟
   term => ⌜4 ⌟
    number => ⌜4⌟
   term => ⌜5 ⌟
    number => ⌜5⌟
   term => ⌜(1 + 3)⌟
    group => ⌜(1 + 3)⌟
     sum => ⌜1 + 3⌟
      product => ⌜1 ⌟
       term => ⌜1 ⌟
        number => ⌜1⌟
      product => ⌜3⌟
       term => ⌜3⌟
        number => ⌜3⌟

If you want to know how a particular number was parsed, you can follow the path backwards by looking for lines above the current line that are indented less; for instance, the number 1 was parsed by token number, called from term, called from product, and so on.

We can verify this by raising an exception from token number:

    token number {
        (\d+)
        { die "how did I get here?" if $0 eq '1' }
    }

This indeed shows the call chain in the backtrace, with the most immediate context at the top:

how did I get here?
  in regex number at bt.p6 line 9
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex group at bt.p6 line 6
  in regex term at bt.p6 line 5
  in regex product at bt.p6 line 4
  in regex sum at bt.p6 line 3
  in regex TOP at bt.p6 line 2
  in block <unit> at bt.p6 line 13

This grammar only uses tokens and rules, so there is no backtracking involved, and the grammar is a predictive parser. This is fairly typical. Many grammars work fine without backtracking, or with backtracking in just a few places.

Recursive Descent Parsing and Precedence

The MathExpression grammar has two rules which are structurally identical:

    rule sum { <product>+ %  '+' }
    rule product { <term>+ % '*' }

Instead, we could have written

rule  expression { <operator>+ % <term> }
token operator   {  '*' | '+' }

or even used the proto token construct discussed in the previous chapter to parse different operators. The reason I chose the first, more repetitive, approach is that it makes the match structure correspond to the precedence of the operators * and +.

When evaluating the mathematical expression 1 + 2 * 5, mathematicians and most programming languages evaluate the 2 * 5 first, because the * operator has tighter precedence than +. The result is then substituted back into the expression, leading to 1 + 10, and finally 11 as the result.

When parsing such expressions with the first version of the grammar, the structure of the parse tree expresses this grouping: it has — as the top level — a single sum, with the operands being 1 and 2 * 5.

This comes at a cost: we need a separate rule and name for each precedence level, and the nesting of the resulting match object has at least one level per precedence level. Furthermore, adding more precedence levels later on is not trivial, and very hard to do in a generic way. If you are not willing to accept these costs, you can instead use the flat model with a single token for parsing all operators. If you then need the structure in a way that reflects precedence, you can write code that transforms the list into a tree. This is commonly called an operator precedence parser.

Left Recursion and Other Traps

To avoid infinite recursion, you have to take care that each possible recursion cycle advances the cursor position by at least one character. In the MathExpression grammar, the only possible recursion cycle is sumproducttermgroupsum, and group can only match if it consumes an initial open parenthesis, (.

If recursion does not consume a character, it is called left recursion and needs special language support that Perl 6 does not offer. A case in point is

token a { <a>? 'a' }

which could match the same input as the regex a+, but instead loops infinitely without progressing.

A common technique to avoid left recursion is to have a structure where you can order regexes from generic (here sum) to specific (number). You only have to be careful and check for consumed characters when a regex deviates from that order (e.g. group calling sum).

Another potential source of infinite loops is when you quantify a regex that can match the empty string. This can happen when parsing a language that actually allows something to be empty. For instance, in UNIX shells, you can assign variables by potentially leaving the right-hand side empty:

VAR1=value
VAR2=

When writing a grammar for UNIX shell commands, it might be tempting to write a token string { \w* } that would potentially match an empty string. In a situation that allows for more than one string literal, <string>+ can then hang, because the effective regex, [\w*]+, tries to match a zero-width string infinitely many times.

Once you are aware of the problem, the solution is pretty simple: change the token to not allow an empty string (token string { \w+ }), and explicitly take care of situations where an empty string is allowed:

    token assignment {
        <variable> '=' <string>?
    }

Starting Simple

Even though a grammar works from the top down, developing a grammar works best from the bottom up. It is often not obvious from the start what the overall structure of a grammar will be, but you usually have a good idea about the terminal tokens: those that match text directly without calling other subrules.

In the earlier example of parsing mathematical expressions, you might not have known from the start how to arrange the rules that parse sums and products, but it’s likely that you knew you had to parse a number at some point, so you can start by writing:

grammar MathExpression {
    token number { \d+ }
}

This is not much, but it’s also not very complicated, and it’s a good way to get over the writer’s block that programmers sometimes face when challenged with a new problem area. Of course, as soon as you have a token, you can start to write some tests:

grammar MathExpression {
    token number { \d+ }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 2;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';
}

Now you can start to build your way up to more complex expressions:

grammar MathExpression {
    token number { \d+ }
    rule product { <number>+ % '*' }
}

multi sub MAIN(Bool :$test!) {
    use Test;
    plan 5;
    ok MathExpression.parse('1234', :rule<number>),
        '<number> parses 1234';
    nok MathExpression.parse('1+4', :rule<number>),
        '<number> does not parse 1+4';

    ok MathExpression.parse('1234', :rule<product>),
        '<product> can parse a simple number';
    ok MathExpression.parse('1*3*4', :rule<product>),
        '<product> can parse three terms';
    ok MathExpression.parse('1 * 3', :rule<product>),
        '<product> and whitespace';
}

It is worth it to include whitespace early on in the tests. The example above looks innocent enough, but the last test actually fails. There is no rule that matches the whitespace between the 1 and the *. Adding a space in the regex between the <number> and the + quantifier makes the tests pass again, because the whitespace inserts an implicit <.ws> call.

Such subtleties are easy to catch if you start really simple and catch them as soon as possible. If instead you give in to the temptation of writing down a whole grammar from top to bottom, you can spend many hours debugging why some seemingly simple thing such as an extra space makes the parse fail.

Assembling Complete Grammars

Once you have written the basic tokens for lexical analysis, you can progress to combining them. Typically the tokens do not parse whitespace at the borders of their matches, so the rules that combine them do that.

In the MathExpression example in the previous section, rule product directly called number, even though we now know that the final version uses an intermediate step, rule term, which can also parse an expression in parentheses. Introducing this extra step does not invalidate the tests we have written for product, because the strings it matched in the early version still match. Introducing more layers happens naturally when you start with a grammar that handles a subset of the language, which you later expand.

Debugging Grammars

There are two failure modes for a regex or a grammar: it can match when it’s not supposed to match (a false positive), or it can fail to match when it’s supposed to match (a false negative). Typically, false positives are easier to understand, because you can inspect the resulting match object and see which regexes matched which part of the string.

There is a handy tool for debugging false negatives: the Grammar::Tracer module. If you load the module in a file containing a grammar, running the grammar produces diagnostic information that can help you find out where a match went wrong.

Note that this is only a diagnostic tool for developers; if you want to give end users better error messages, please read Chapter 11 for improvement suggestions.

You need to install the Perl 6 module Grammar::Debugger, which also contains Grammar::Tracer. If you use the moritzlenz/perl6-regex-alpine docker image, this is already done for you. If you installed Perl 6 via another method, you need to run

zef install Grammar::Debugger

on the command line. If zef is not yet installed, follow the installation instructions on the zef GitHub page.

Let’s look at the Perl 6 module Config::INI by Tadeusz Sośnierz. It contains the following grammar (slightly reformatted here):

grammar INI {
    token TOP {
        ^ <.eol>* <toplevel>?  <sections>* <.eol>* $
            }
    token toplevel { <keyval>* }
    token sections { <header> <keyval>* }
    token header   { ^^ \h* '[' ~ ']' $<text>=<-[ \] \n ]>+
                     \h* <.eol>+ }
    token keyval   { ^^ \h* <key> \h* '=' \h* <value>? \h*
                     <.eol>+ }
    regex key      { <![#\[]> <-[;=]>+ }
    regex value    { [ <![#;]> \N ]+ }
    token eol      { [ <[#;]> \N* ]? \n }
}

Suppose we want to understand why it does not parse the following piece of input text:

a = b
[foo]
c: d

So, before the grammar, we insert the line

use Grammar::Tracer;

and after it, add a small piece of code that calls the .parse method of that grammar:

INI.parse(q:to/EOF/);
a = b
[foo]
c: d
EOF

This produces a sizable, but fairly informative piece of output.

Each entry consists of a name of a regex, like TOP or eol (for "end of line"), followed by the indented output of the regexes it calls. After each regex comes a line containing an asterisk (*) and either MATCH followed by the string segment that the regex matched, or FAIL if the regex failed.

Let’s look at the output piece by piece, even if it comes out in one chunk:

TOP
|  eol
|  * FAIL
|  toplevel
|  |  keyval
|  |  |  key
|  |  |  * MATCH "a "
|  |  |  value
|  |  |  * MATCH "b"
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "a = b\n"
|  |  keyval
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  * MATCH "a = b\n"

This tells us that TOP called eol, which failed to match. Since the call to eol is quantified with *, this does not cause the match of TOP to fail. TOP then calls key, which matches the text "a", and value, which matches "b". The eol regex then proceeds to match the newline character, fails on the second attempt (since there are no two newline characters in a row). This causes the initial keyval token to match successfully. A second call to keyval matches pretty quickly (in the call to key). Then, the match of token toplevel proceeds successfully, consuming the string "a = b\n".

So far this all looks as expected. Now let’s take a look at the second chunk of output:

|  sections
|  |  header
|  |  |  eol
|  |  |  * MATCH "\n"
|  |  |  eol
|  |  |  * FAIL
|  |  * MATCH "[foo]\n"
|  |  keyval
|  |  |  key
|  |  |  * MATCH "c: d\n"
|  |  * FAIL
|  * MATCH "[foo]\n"

TOP next calls sections, wherein token header successfully matches the string "[foo]\n". Then, keyval calls key, which matches the whole line "c: d\n". Wait, that’s not right, is it? We might have expected key to only match the c. I certainly wouldn’t have expected it to match a newline character at the end. The lack of an equals sign in the input causes the regex engine to never even call regex value. But since keyval is again quantified with the star * quantifier, the match of the calling regex sections succeeds in matching just the header "[foo]\n".

The last part of the Grammar::Tracer output follows:

|  sections
|  |  header
|  |  * FAIL
|  * FAIL
|  eol
|  * FAIL
* FAIL

It’s FAILs from here on. The second call to sections again tries to parse a header, but its next input is still "c: d\n", so it fails. As does the end-of-string anchor $ in token TOP, failing the overall match in method parse.

So we have learned that regex key matched the whole line c: d\n, but since no equals sign (=) follows it, token keyval cannot parse this line. Since no other regex (notably not header) matches it, this is where the match fails.

As you can see from this example run, Grammar::Tracer enables us to pinpoint where a parse failure happens, even though we had to look carefully through its output to locate it. When you run it in a terminal, you automatically get colored output, with FAIL having a red and MATCH a green background, and token names standing out in bold white (instead of the usual gray) output. This makes it easier to scan from the bottom (where a failed match usually leaves a trail of red FAILs) up to the trailing successful matches, and then look in the vicinity of the border between matches and failures.

Since debugging imposes a significant mental burden, and the output from Grammar::Tracer tends to grow quickly, it is generally advisable to reduce the failing input to a minimal specimen. In the case described above, we could have removed the first line of the input string and saved ten lines of Grammar::Tracer output to look through.

Parsing Whitespace and Comments

As said before, the idiomatic way to parse insignificant whitespace is by calling <.ws>, typically implicitly by using whitespace in a rule. The default ws implementation, <!ww>\s*, works well for many languages, but it has its limits.

In a surprising number of file formats and computer languages, there is significant whitespace that <.ws> would just gobble up. These include INI files (where a newline typically indicates a new key/value pair), Python and YAML (where indentation is used for grouping), CSV (where a newline signals a new record), and Makefiles (where indentation is required to be with a tabulator character).

In these cases, it is best practice to override ws in your own grammar to match only insignificant whitespace. Let’s take a look at a second, minimalistic INI parser, independently developed from the one described in the previous section:

grammar INIFile {
    token TOP { <section>* }
    token section {
        <header>
        <keyvalue>*
    }
    rule header {
        '['  <-[ \] \n ]>+ ']' <.eol>
    }
    rule keyvalue {
        ^^
        $<key>=[\w+]
        <[:=]>
        $<value>=[<-[\n;#]>*]
        <.eol>
    }
    token ws { <!ww> \h* }
    token eol {
        \n [\h*\n]*
    }
}

This parses simple INI configuration files like this:

[db]
driver: mysql
host: db01.example.com
port: 122
username: us123
password: s3kr1t

Take note how this grammar uses two paths for parsing whitespace: a custom ws token that only matches horizontal whitespace (blanks and tabs), and a separate token eol that matches (significant) line breaks. The eol token also gobbles up further lines consisting only of whitespace.

If a language supports comments, and you don’t want them to appear in your parse tree, you can parse them either in your ws token, or in eol (or your equivalent thereof). Which one it is depends on where comments are allowed. In INI files, they are only allowed after a key-value pair or in a line on their own, so eol would be the fitting place. In contrast, SQL allows comments in every place where whitespace is allowed, so it is natural to parse them in ws:

# comment parsing for SQL:
token ws { <!ww> \s* [ '--' \N* \n ]* }

# comment parsing for INI files:
token eol { [ [ <[#;]> \N* ]? \n ]+ }

Keeping State

Some of the more interesting data formats and languages require the parser to store things (at least temporarily) to be able to correctly parse them. A case in point is the C programming language, and others inspired by its syntax (such as C++ and Java). Such languages allow variable declarations of the form type variable = initial_value, like this:

int x = 42;

This is valid syntax, but only if the first word is a type name. In contrast, this would be invalid, because x is not a type:

int x = 42;
x y = 23;

From these examples it is pretty clear that the parser must have a record of all the types it knows. Since users can also declare types in their code files, the parser must be able to update this record.

Many languages also require that symbols (variables, types, and functions) be declared before they are referenced. This too requires the grammar to keep track of what has been declared and what hasn’t. This record of what has been declared (and what is a type or not, and possibly other meta information) is called a symbol table.

Instead of parsing the full C programming language, let’s consider a minimalist language that just allows assignments of lists of numbers, and variables to variables:

a = 1
b = 2
c = a, 5, b

If we don’t impose declaration rules, it’s pretty easy to write a grammar:

grammar VariableLists {
    token TOP        { <statement>* }
    rule  statement  { <identifier> '=' <termlist> \n }
    rule  termlist   { <term> * % ',' }
    token term       { <identifier> | <number> }
    token number     { \d+ }
    token identifier { <:alpha> \w* }
    token ws         { <!ww> \h* }
}

Now we demand that variables can only be used after they’ve been assigned to, so that the following input would be invalid, because b is not declared in the second line, where it’s used:

a = 1
c = a, 5, b
b = 2

To maintain a symbol table, we need three new elements: a declaration of the symbol table, some code that adds a variable name to the symbol table when the assignment has been parsed, and finally a check whether a variable has been declared at the time we come across it in a term list:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

In the token TOP, :my %*SYMBOLS declares a variable. Declarations in regexes start with a colon (:), and end with a semicolon (;). In between they look like normal declarations in Perl 6. The % sigil signals that the variable is a hash — a mapping of string keys to values. The * makes it a dynamic variable — a variable that is not limited to the current scope but also visible to code (or regexes, which are also code) that is called from the current scope. Since this is an unusually large scope, it is custom to choose a variable in CAPITAL LETTERS.

The second part, adding a symbol to the symbol table, happens in the rule statement:

    rule statement {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True }
        '=' <termlist>
        \n
    }

Inside the curly braces is regular (non-regex) Perl 6 code, so we can use it to manipulate the hash %*SYMBOLS. The expression $<identifier> accesses the capture for the variable name2. Thus, if this rule parses a variable a, this statement sets %*SYMBOLS{ 'a' } = True.

The placement of the code block is relevant. Putting it before the call to termlist means that the variable is already known when the term list is parsed, so it accepts input like a = 2, a. If we call termlist first, this kind of input is rejected.

Speaking of rejection, this part happens in token variable. term now calls the new token variable (previously it called identifier directly), and variable validates that the symbol has been declared before:

    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }

You might remember from earlier examples that <?{ ... }> executes a piece of Perl 6 code, and fails the parse if it returns a false value. If $<identifier> is not in %*SYMBOLS, this is exactly what happens. At this time the non-backtracking nature of tokens is important. If the variable being parsed is abc, and a variable a is in %*SYMBOLS, backtracking would try shorter matches for <identifier> until it hits a, and then succeeds3.

Since %*SYMBOLS is declared in token TOP, you have to duplicate this declaration when you try to call rules other than TOP from outside the grammar. Without a declaration such as my %*SYMBOLS;, a call like

VariableLists.parse('abc', rule => 'variable');

dies with

Dynamic variable %*SYMBOLS not found

Implementing Lexical Scoping with Dynamic Variables

Many programming languages have the concept of a lexical scope. A scope is the area in a program where a symbol is visible. We call a scope lexical if the scope is determined solely by the structure of the text (and not, say, runtime features of the program).

Scopes can typically be nested. A variable declared in one scope is visible in this scope, and in all inner, nested scopes (unless an inner scope declares a variable of the same name, in which case the inner declaration hides the outer).

Coming back to the toy language of lists and assignments, we can introduce a pair of curly braces to denote a new scope, so this is valid:

a = 1
b = 2
{
    c = a, 5, b
}

but the next example is invalid, because it declares b only in an inner scope, and so it is not visible in the outer scope:

a = 1
{
    b = 2
}
c = a, 5, b

To implement these rules in a grammar, we can make use of an important observation: dynamic scoping in a grammar corresponds to lexical scoping in text it parses. If we have a regex block that parses both the delimiters of a scope and the things inside that scope, its dynamic scope is confined to all of the regexes it calls (directly and indirectly), and that is also the extent of the lexical scope it matches in the input text.

Let’s take a look at how we can implement dynamic scoping:

grammar VariableLists {
    token TOP {
        :my %*SYMBOLS;
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { %*SYMBOLS{ $<identifier> } = True; }
        '=' <termlist>
        \n
    }
    rule block {
        :my %*SYMBOLS = CALLERS::<%*SYMBOLS>;
        '{' \n*
            <statement>*
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ %*SYMBOLS{ $<identifier> } }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

There are a few changes to the previous version of this grammar: the rule statement has been renamed to declaration and the new rule statement parses either a declaration or a block.

All the interesting bits happen in the block rule. The line :my %*SYMBOLS = CALLERS::<%*SYMBOLS>; declares a new dynamic variable %*SYMBOLS and initializes it with the previous value of that variable. CALLERS::<%*SYMBOLS> looks through the caller, and the caller’s caller, and so on for a variable %*SYMBOLS, and thus looks up the value corresponding to the outer scope. The initialization creates a copy of the hash, such that changes to one copy do not affect the other copies.

Let’s take a look at what happens when this grammar parses the following input:

a = 1
b = 2
{
    c = a, 5, b
}

After the first two lines, %*SYMBOLS has the value {a => True, b => True}. When rule block parses the opening curly bracket on the third line, it creates a copy of %*SYMBOLS. The declaration of c on the fourth line inserts the pair c => True into the copy of %*SYMBOLS. After rule block parses the closing curly brace on the last line, it exits successfully, and the copy of %*SYMBOLS goes out of scope. This leaves us with the earlier version of %*SYMBOLS (with only the keys a and b), which then goes out of scope when TOP exits.

Scoping Through Explicit Symbol Tables

Using dynamic variables for managing symbol tables usually works pretty well, but there are some edge cases where a more explicit approach works better. Such edge cases include those where there are so many symbols that copying becomes prohibitively expensive, or where more than the top-most scope must be inspected, or when copying the symbol table is impractical for other reasons.

Consequently, you can write a class for your symbol table (which in the simplest case uses an array as a stack of scopes) and explicitly call methods on it when entering and leaving scopes, when declaring a variable, and for checking whether a variable is known in a scope:

class SymbolTable {
    has @!scopes = {}, ;
    method enter-scope() {
        @!scopes.push({})
    }
    method leave-scope() {
        @!scopes.pop();
    }
    method declare($variable) {
        @!scopes[*-1]{$variable} = True
    }
    method check-declared($variable) {
        for @!scopes.reverse -> %scope {
            return True if %scope{$variable};
        }
        return False;
    }
}

grammar VariableLists {
    token TOP {
        :my $*ST = SymbolTable.new();
        <statement>*
    }
    token ws { <!ww> \h* }
    token statement {
        | <declaration>
        |  <block>
    }
    rule declaration {
        <identifier>
        { $*ST.declare( $<identifier> ) }
        '=' <termlist>
        \n
    }
    rule block {
        '{' \n*
            { $*ST.enter-scope() }
            <statement>*
            { $*ST.leave-scope() }
        '}' \n*
    }
    rule termlist { <term> * % ',' }
    token term { <variable> | <number> }
    token variable {
        <identifier>
        <?{ $*ST.check-declared($<identifier>) }>
    }
    token number { \d+ }
    token identifier { <:alpha> \w* }
}

The class SymbolTable has the private array attribute @!scopes which is initialized with a list containing a single, empty hash {}. Entering a scope means pushing an empty hash on top of this array, and when leaving the scope it is removed again through the pop method call. A variable declaration adds its name to the top-most hash, @!scopes[*-1].

Checking for the presence of a variable must not just consider the top-most hash, because variables are inherited to inner scopes. Here we go through the all scopes in reverse order, from inner-most to outer-most scope. The order of traversal is not relevant for a simple Boolean check, but if you need to look up information associated with the variable, it is important to adhere to this order to reference the correct one.

Token TOP creates a new object of class SymbolTable, declaration calls the declare method, and token variable calls method check-declared. The rule block calls enter-scope before parsing the statement list, and leave-scope afterwards. This works, but only if the statement list can be parsed successfully; if not, rule block fails before it manages to call leave-scope.

Perl 6 has a safety feature for such situations: if you prefix a statement with LEAVE, Perl 6 calls it for you at routine exit, in all circumstances where this is possible (even if an exception is thrown). Since the LEAVE phaser only works in regular code and not in regexes, we need to wrap the regex in a method:

    method block {
        $*ST.enter-scope();
        LEAVE $*ST.leave-scope();
        self.block_wrapped();
    }
    rule block_wrapped {
        '{' \n*
            <statement>*
        '}' \n*
    }

Now we have the same robustness as the approach with dynamic variables, and more flexibility to add extra code to the symbol table, at the cost of more code and increased effort.

Summary

Perl 6 grammars are a declarative way to write recursive descent parsers. Without backtracking, they are predictive; at each point, we know what list of tokens to expect.

The recursive nature of grammars comes with the risk of left recursion, a situation where a recursive path does not consume any characters, and so leads to an infinite loop.

In spite of the top-down nature of grammars, writing them typically happens from the bottom up: starting with lexical analysis, and then moving up to parsing larger structures.

Complex languages require additional state for successful and precise parsing. We have seen how you can use dynamic variables to hold state in grammars, how their scope corresponds to lexical scoping in the input, and how symbol tables can be written and integrated into grammars.


  1. Like a Swiss-army knife, but with much more power.

  2. At this point it is crucial that identifier does not parse its surrounding whitespace. Hence the principle that tokens do not care about whitespace, and the rules that call those tokens parse the whitespace.

  3. In this case, this would be harmless, because no other rule could match the rest of the variable, leading to a parse error nonetheless. But in more complicated cases, this kind of unintended backtracking can lead to errors that are very puzzling for the maintainer of the grammar.

Day 6 – Perl 6 Books — the Time is Ripe

One question we occasionally get on the phenomenal #perl6 IRC channel is about Perl 6 books. It turns out, some folks don’t want to learn just from tutorials, blog posts and docs. Last year, we didn’t have any good answers to that question.

A few months ago, there seemed to be a flurry of such questions, and at the same time, rumours about new book projects started to spread.

If I remember correctly, the first rumour was when Laurent Rosenfeld contacted me in June, asking for feedback on a manuscript. He had translated a good 200 pages of the Think Python book to Perl 6. This is primarily a book teaching programming to beginners. Later I was happy to hear that a major publisher has accepted the manuscript. The manuscript is now finished, and under review. So I guess we’ll see e-book and print versions of this book in due time.

Then brian d foy opened a kickstarter to raise funds for a “Learning Perl 6” book. By the time this is published, the kickstarter project should still be open, so I warmly encourage you to support this. Having a book by a prominent member of the Perl 5 community, and author of several good Perl books would surely be a boon for the Perl 6 community.

Before the publication of brian’s project, I’ve been mulling about writing my own Perl 6 book. In the past, I’ve tried that once already, and failed. But I had more success with another book project of mine, so I decided to give it another shot. The working title is Perl 6 by example. Content for the book is published in form of blog posts, and later turned into book chapters.

Later I learned that Ken Youens-Clark had written a book on metagenomics that spends about 150 pages explaining Perl 6. And it’s free, you can download the PDF, epub and mobi versions right away! If you’ve followed the advent calendar, you might have read Ken’s earlier post on Bioinformatics

In summary, we have one book starring Perl 6 content, one in progress with the manuscript in stage “feature complete”, one in the kick-starting phase which you can support, and a fourth being written right now, with parts being pre-published in the form of blog posts.

If you want to keep up-to-date with news on Perl 6 books, you can sign up for the low volume Perl 6 book mailing list (less than one mail per month).

I hope that in next year’s advent calendar, we’ll see four reviews of Perl 6 books.

Day 1 – The State of Perl 6 in 2015

Please fasten your seat belt for your annual dive into Perl 6.

As has been customary the last six years, we start with a brief overview of the state of Perl 6.

Last year's State of Perl 6 anticipated an upcoming production release of Perl 6. That is still scheduled for this Christmas. Last year's summary also identified major areas of work to-be-done for this release.

The 2015.05 release of Rakudo introduced NFG or "Normal Form Grapheme", which means that strings in Perl 6 are not based on Unicode codepoints anymore, and instead on grapheme clusters. Grapheme clusters can consist of base characters, like the latin lower-case c, and combining characters, like the "combining cedilla" character. Together, they form a single, visual character "c with cedilla", ç. This character also exists as a single codepoint, but other combinations of base character and combining character don't. Yet with this new feature, Perl 6 still treats the cluster of base character plus one (or more) combining characters as a single character, so regexes matches and substr won't tear them apart.

In September, Rakudo shipped with the GLR or Great List Refactoring in place. This mostly means that the rules for using and accessing nested data structures are now much simpler and more consistent. Under the hood we also have a sane and powerful iterator model, and a new type Seq for lazy value streams that don't necessarily memorize old values on iteration.

Finally, the September release introduced native, shaped arrays (or NSA). This allows you to write

    my int32 @matrx[4;5]

which allocates a continuous block of 20 32-bit values, but is still usable as a two-dimensional matrix in Perl 6 land. This paves the road towards memory efficient linear algebra (and other applications, of course).

But, development didn't stop there. The upcoming December release of Rakudo brings us automatic precompilation of modules and installation repositories.

Not only the compiler progresses. Where the Perl 6 Modules page showed around 270 modules a year ago (end of November 2014), we are now at about 460 modules. I'm also happy to report that there are two module installers now, panda and zef.

We also now have decent documentation coverage, at least on built-in types; other areas such as tutorials and material on language features are still a bit sparse. Other documentation efforts such as Learn Perl 6 in Y minutes and perl6intro.com have sprung up, and enjoy popularity.

I'm sure there has been more Perl 6 activity that would be worth reporting, but the excitment for the upcoming release makes a calm retrospective a bit harder than usual.

Stay tuned for more, and have fun!

Day 23 – Webscale sorting of the sleepy kind

In the emerging tradition of writing about sort on the 23rd, let me elaborate on a particularly silly kind of sorting algorithm: sleepsort.

The idea is that to sort a list of numbers, you spawn a separate process (or thread), sleep for as long as the number specifies, and then print the result. That way the numbers are printed in ascending order.

Rakudo on the MoarVM backend has threads, so let's put them to good use:

 use v6;
 
 my @unsorted = (1..10).pick(5);
 
 say "Unsorted: @unsorted[]";
 
 await @unsorted.map: -> $t {
     start {
         sleep $t;
         say $t;
     }
 };

If you run it, you should get output like this:

 $ ./perl6-m sleepsort-01.p6
 Unsorted: 1 7 5 8 3
 1
 3
 5
 7
 8

How did we get there? (1..10).pick(5) randomly picks (without replacing) five numbers from 1 to 10. start { ... } starts a thread and returns a Promise object. We do that for all the numbers, and await blocks until all promises that have passed to it are fullfilled (or broken, but that won't happen here). Or in other words: until all the newly spawned threads have finished.

Now let's make it web scale! Well, sort (sic) of.

First an obvious improvement is to not sleep for the whole second. Let's just scale it with a constant factor: sleep $t / 10, and we're about ten times as fast!

But there's another scaling problem: If you want to sort hundreds of thousands of numbers with sleep sort (you'd do that, right?), our program would spawn as many threads. So lots of memory wasted. Also sleep maps pretty much directly to the underlying libc call, and thus blocks the thread. We can do better:

 use v6;
 
 my @unsorted = (1..10).pick(5);
 
 await @unsorted.map: -> $t {
     Promise.in($t / 10 ).then({ say $t });
 };

Promise.in($s) creates a Promise that will be kept in $s seconds. .then creates a chained promise whose block is executed when the first one is kept.

This also removes the need to spawn processes explicitly. The standard scheduler takes care of that. We're basically web scale!

Now it's also time to address an issue of correctness. The sleepsort algorithm is a kind of divide-and-conquer approach, but there's no joining of the results. There is no single point in the program that can access the entire the sorted list. To fix that, we use a Channel, a thread-safe queue:

 use v6;
 my @unsorted = (1..10).pick(5);
 
 my $channel = Channel.new;
 
 await @unsorted.map: -> $t {
     Promise.in($t / 10 ).then({ $channel.send($t) });
 };
 $channel.close;
 
 say $channel.list;

This prints the sorted list all from the main thread.

So now it can be encapsulated in a nice subroutine.

 sub sleepsort(*@values, :$factor) {
     my $channel = Channel.new;
 
     await @values.map: -> $t {
         Promise.in($t * $factor ).then({ $channel.send($t) });
     };
     $channel.close;
 
     $channel.list;
 }

Another issue of correctness remains: Both sleep and Promise.in only specify a minimal time to be slept; implementation-dependent, longer sleep times are possible. If $factor is too small, the promises might executed out of the desired order.

So let's find the minimal factor with which it still works:

 my $previous = Inf;
 for 0.1, */1.5 ... * -> $factor {
     say "Trying $factor";
     my $success = [<=] sleepsort(:$factor, @unsorted);
     say $success ?? 'Success' !! 'Failed';
     unless $success {
         say "Last good factor: ", $previous;
         last;
     }
     $previous = $factor;
 }

On my machine, this produces the following output:

 Trying 0.1
 Success
 Trying 0.066667
 Success
 Trying 0.044444
 Success
 Trying 0.029630
 Success
 Trying 0.019753
 Success
 Trying 0.013169
 Success
 Trying 0.008779
 Success
 Trying 0.005853
 Failed
 Last good factor: 0.008779

So a :$factor of around 0.01 or 0.09 seems to work on my setup.

Your output/mileage may vary.

Over and out, sleeping until Christmas Eve.

Day 10 – Introspecting the Symbol Tables

Perl 6 is designed with extensibility in mind. And when you want to extend something, you often need to know as much as possible about the environment.

Today we’ll look at an aspect of finding out about the environment: introspecting symbol tables.

A symbol is something you refer to by name, so it could be a package, class, variable, routine, constant etc.

A symbol table is a data structure in which symbols can be looked up. Perl 6 has three main types of symbol tables, differentiated by scope: lexical symbols live in the lexical pad, package symbols live in a Stash, and methods in a method table.

Enough theory, time for action:

    $ perl6 -e 'my $x = 42; say MY::.keys'
    $x $! $/ $_ GLOBALish EXPORT $?PACKAGE ::?PACKAGE $=pod !UNIT_MARKER

MY is a pseudo package representing the current scope, and appending two colons gives us its symbol table. Which in turn roughly behaves like a Hash, so we can use a method like keys to find all symbols in that table. Or look up a string there:

    $ perl6 -e 'my $x = 42; say MY::<$x>'
    42

A complicated way to say $x, but you can use that to lookup symbols by name:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say MY::{$var}'
    42

Or if you don’t care if it comes from the current lexical scope, just from somewhere:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say ::($var)'
    42

No need to tell you that in general, this is a bad idea; just use it when you have exceptional circumstances.

But wait, what were all those other symbols from the first line? $!, $/ and $_ are the three “special” variables for errors, matches and the current topic; the rest is stuff that doesn’t need to be in lexical scope, but the compiler puts it there because it makes stuff easier.

Outer lexicals are accessible through the OUTER:: pseudo package:

    $ perl6 -e 'my $x = 1; { my $x = 2; say OUTER::<$x>; say $x }'
    1
    2

Instead of OUTER::<$x>, you can also say $OUTER::x. In general, you can move the sigil to the front and omit the angle brackets if the symbol name is a literal.

Package declarations and explicitly our-scoped symbols generally go into a stash, with the top-most being GLOBAL::

    class A {
        class B { }
    }
    say GLOBAL::<A>;    # (A)
    say A;              # (A)
    say A::.keys;       # B

Not only the double colon gives you the symbol table, you can also get it with .WHO (which also works for non-literals). So a long way to write A::B is

    say A::B === GLOBAL::<A>.WHO<B>;    # True

Finally, methods live in a classes or roles, and you can use the meta object to introspect them; those generally respect inheritance and role composition, so they are a bit more complicated than stashes.

    my $m = Str.^can('split')[0];
    say $m.^name;               # Method
    say $m('a1B', /\d/).perl;    # ("a", "B").list

Here .^can(name) returns a list of methods on that object with a given name. It’s a list, because it includes methods from superclasses, so there can be more than one. To get a list of all available methods, you can use .^methods and .^methods(:all):

    $ perl6-m -e 'say  Str.^methods(:all).grep: {.name ~~ /^m/ }'
    match match map min max minmax

.^methods(:all) includes methods from all superclasses, without the :all, methods from classes Cool, Any and Mu are excluded.

This concludes our short tour through the land of the symbol table. There is more to see (for example the EXPORT package, and various pseudo packages), and the daring can go to the design documents or play with Rakudo’s implementation, which is fairly robust.

Day 01 – The State of Perl 6 in 2014

Welcome to the 6th annual edition of the Perl 6 advent calendar, code name 2014.

In 2014, MoarVM has become the de facto standard backend for the Rakudo Perl 6 compiler. Parrot and JVM are still supported as backends, but MoarVM provides much lower memory usage, faster startup time, and is significantly faster than parrot at runtime.

Rakudo on MoarVM also has concurrency support, including reactive programming.

Much work has been done to improve performance, some in MoarVM, some in Rakudo and NQP. Maybe most notably is the JIT compiler for MoarVM that came out of the Google Summer of Code project by Bart Wiegmans.

Another Google Summer of Code project by Filip Sergot brought us HTTP::UserAgent, a versatile HTTP client library with SSL/TLS support.

During the Austrian Perl Workshop in fall 2014, many of the Perl 6 and Rakudo core contributors met, and identified three roadblocks for a "final" Perl 6.0 release: GLR, NFG and NSA.

Here GLR stands for the Grand List Refactoring, a plan to make the list-y types more transparent, list iteration faster, and more obvious when a list will flatten.

NFG is the Normal Form Grapheme, a plan for implementing grapheme-based string indexing.

Finally, our NSA has nothing to do with surveillance. Natively Shaped Arrays are a flexible feature for declaring typed, potentially multi-dimensional arrays, potentially with pre-defined dimensions. It will make memory-efficient matrix storage and operations possible, for example.

With these three blockers defined, Larry Wall submitted a talk to FOSDEM called Get ready to party!, predicting that 2015 will be the year that Perl 6 will get a production release.

Sorry, this was meant to become a summary of the current state of Perl 6, and it derailed into an outlook. To allow me to keep the "state" in the title, let me just tell you that Rakudo on the MoarVM is quite fun to work with. It's fast enough for small (and sometimes even mid-sized) tasks, module installation works fairly reliably, and the range of available modules has also increased.

Also I feel that any attempt to summarize the progress of this awesome community is bound to be very incomplete; I hope that my fellow Perl 6 hackers will fill in some details in the upcoming 23 posts.

Have a nice pre-Christmas time, and enjoy the show!