Day 15 – Parsing Globs using Grammars

UPDATE 12/21/16: After some discussions here and at reddit, I realized I’d made a couple important mistakes. Sorry about that. I have corrected GlobAction and a bug in the glob helper subroutine. I also replaced the use of “AST” with “parse tree” below as well as that is more technically correct.

If you are at all familiar with Perl, you are likely to be familiar with regexes. In Perl 6, regexes have been extended to provide features and syntax allowing complex parsing tasks to be simplified and grouped together into grammars. To give a taste of this, let’s consider how to parse file globs. It’s simple, it’s well-defined, and I have some practice writing parsers for the task.

Introducing File Globs

In case you are not familiar with them, a file glob it a sort of pattern match for file names. For example, consider this glob:

*.txt

This will match any file ending in “.txt”. The asterisk (“*”) means “match as many characters as needed to make the match”. This matches “foo.txt” and “blah.blah.txt”, but not “z.html”.

Here’s another glob to consider:

.??*

This will match any file that starts with a period (“.”) followed by 2 or more characters. That means it will match a file named “.zshrc”, but not one named “..”, making it a popular heuristic for finding “hidden” files on a Unix file system.

Basic Grammar

What does a basic glob parsing grammar look like? Let me show a simple one and then I will explain the interesting bits:

grammar Glob {
    token TOP { + }
    token term {
        || 
        || 
    }
    token match {  |  }
    token match-any { '*' }
    token match-one { '?' }
    token char { . }
}

This is just about the simplest possible grammar for a glob that covers the two most popular forms. Let’s consider the parts.

A grammar is just a sort of class. Grammars may contain rules and tokens. Rules and tokens are like methods, but which are defined as a regex and are used to match the input to the grammar. Here we define a grammar named “Glob” to parse our glob.

Every grammar may define a special rule or token named “TOP”. This becomes the default regex to use when matching input with that grammar. Without it, any attempt to parse with the grammar must name the rule or token to begin with.

Aside from that each rule and token defines a regex to use when matching the input. For example, the match-any token here will match only an asterisk (“*”). Similarly, the char token will match any single character.

To be useful as a grammar, rules and tokens refer to one another using angle brackets (“<>”). Here, TOP contains one or more term matches, term matches either a match or a char (using || to try each match in the order given).

Whitespace Handling

Why do I only use token keywords and never make use of the rule? Because of whitespace. A rule inserts some implicit whitespace handling, which is helpful in keeping parsers readable. However, this handling is not helpful when parsing file globs.

Consider the following rule and token:

token { 'a' 'b'+ }
rule  { 'a' 'b'+ }

The token will match “abbbbb” and “ab” but not “a b” or “a bbbb”. The rule, however, will not match “abbbbb” or “ab” but will match “a b” and “a bbbb”. This is the whitespace handling at work. The rule above is implicitly similar to this token:

token { 'a'  'b'+  }

That ws is a predefined token available in all grammars. The built-in definition implies there is a word-break and zero or more spaces, newlines, or other whitespace.

Depending on what works better for you, you may redefine ws if you prefer something different:

token ws { "\n" } # whitespace is always newlines

Or you may just use tokens to avoid the implicit whitespace handling. This latter solution is what I believe makes the most sense for this grammar.

Parsing Input

Now that we have our basic grammar, how do we parse the input? Here’s a simple example:

my $ast = Glob.parse("*.?");
dd $ast;
# OUTPUT is an ugly, long AST
# Match $ast = Match.new(ast => Any, list => (), hash =>
# Map.new((:term([Match.new(ast => Any, list => (), hash =>
# Map.new((:match(Match.new(ast => Any, list => (), hash => Map.new(()), 
# orig => "*.?", to => 1, from => 0)))), orig => "*.?", to => 1, from => 
# 0), Match.new(ast => Any, list => (), hash => Map.new((:char(Match.new(
# ast => Any, list => (), hash => Map.new(()), orig => "*.?", to => 2, 
# from => 1)))), orig => "*.?", to => 2, from => 1), Match.new(ast => Any, 
# list => (), hash => Map.new((:match(Match.new(ast => Any, list => (), 
# hash => Map.new(()), orig => "*.?", to => 3, from => 2)))), orig => 
# "*.?", to => 3, from => 2)]))), orig => "*.?", to => 3, from => 0) 

As we get output, the parse succeeded. Given how accepting our grammar is, everything but an empty string ought to succeed. If a parser fails, the result of parse is an undefined Any.

If you’ve used regexes in Perl 6 at all, this ASTparse tree should look very familiar. Grammars are really nothing more than a little bit of extra structure around regexes. The ASTsparse trees they generate are just regex Match objects that may have additional matches nested within.

To do something useful with this ASTparse tree, you have some options. You could write a tree walker that will go through the ASTparse tree and do something interesting. However, Perl grammars provide a built-in mechanism to do that while in the middle of the parse called an action class. Let’s use an action class instead.

Action Classes

An action class is a regular class that can be paired to a grammar to perform actions based upon each rule or token matched. The methods in the action class will mirror the names of the rules and tokens in the grammar and be called for each.

Here is an action class that is able to translate the parsed glob into a matcher class:

class GlobAction {
    method TOP($/) { make GlobMatcher.new(terms => $».made) }
    method term($/) { make $/.values».made.first }
    method match($/) { make $/.values».made.first }
    method match-any($/) { make '*' }
    method match-one($/) { make '?' }
    method char($/) { make ~$/ }
}

Each method will receive a single argument which is the match object. If the argument list is not given, the implicit match variable $/ is used instead (here we make that explicit). The action class can perform any action in response to the match it wants. Often, this involves rewriting the return value of the match by using the make command. This sets the value returned by the .made method on the match (as you can see is used in term and match).

To put this action class into use, you pass the action class during parsing like so:

my $matcher = Glob.parse("*.txt", :actions(GlobAction.new)).made;
for  -> $f {
    say "$f matches" if $f ~~ $matcher;
}

Now all we need is the magic defined within GlobMatcher, which does the real work of testing whether a given file name matches the parsed glob.

Putting it all together

Here is an implementation of matching. It uses the ACCEPTS method to define a smartmatch operation that can be applied to any defined string.

class GlobMatcher {
    has @.terms;

    method ACCEPTS(Str:D $s) {
        my @backtracks = [ \([ $s.comb ], [ @!terms ]) ];
        BACKTRACK: while @backtracks.pop -> (@letters, @terms) {
            LETTER: while @letters.shift -> $c {
                last LETTER unless @terms;

                my $this-term = @terms.shift;
                my $next-term = @terms[0];

                given $this-term {
                    when '*' {
                        # Continue to the next term if we can
                        if @terms 
                        and $next-term ne '*' | '?' 
                        and $c eq $next-term {
                            push @backtracks, 
                                ([ @letters ], [ '*', |@terms ]);
                            redo;
                        }

                        # Match anything, and try again next round
                        unshift @terms, $this-term;
                    }

                    # We have a letter, so we match!
                    when '?' { }

                    # Only match exactly
                    default {
                        # If not an exact match, we fail; 
                        # try again if we can
                        next BACKTRACK if $c ne $this-term;
                    }
                }
            }

            # If we matched everything, we succeed
            return True unless @terms;

            # Otherwise, try the next backtrack, if any
        }

        # We ran out of back tracks, so we fail
        False;
    }
}

This algorithm is kind of ugly and might be prettier if I’d implemented it via a DFA or translated to regular expressions or something, but it works.

Typing out the code to parse is a bit tedious, so we’ll build ourselves a little helper to make it cleaner looking:

sub glob($glob, :$globlang = Glob) { 
    $globlang.parse($glob, :actions(GlobAction.new)).made; 
}

for  -> $f {
    next unless $f ~~ glob('*.txt');
    say "$f matches";
}

We could even use it in a nice grep:

.grep(glob('*.txt')).say;

Extending the Grammar

Now that we have a basic working grammar, let’s say we want to expand it to support a variant based on ANSI/SO SQL’s LIKE operator. It is very much like a glob, but uses percent (“%”) in place of asterisk (“*”) and underscore (“_”) in place of question mark (“?”). We could implement this as a brand new grammar, but let’s make our new grammar a sub-class instead, so we can avoid duplicating work.

grammar SQLGlob is Glob {
    token match-any { '%' }
    token match-one { '_' }
}

.grep(
    glob('%.txt', :globlang(SQLGlob))
).say;

That was too easy.

Expanding the Grammar

From here, we might also try to expand the grammar to handle character classes or curly-expansions:

[abc]*.{txt,html}

In BSD-like globs, the above will match any file starting with “a” or “b” or “c” that also ends with “.txt” or “.html”. In which case, we might consider using a proto to make it easy to expand the kinds of matches we allow. A full demonstration of this is beyond the scope of this post, but it could look something like this:

class Glob {
    ...

    token term {
        || 
        || 
        || 
    }
    proto token expansion { * }
    proto token match { * }

    ...
}

class BasicGlob is Glob {
    token match:any { '*' }
    token match:one { '?' }
}

class BSDGlob is BasicGlob {
    token expansion:list { '{' + % ',' '}' }
    token match:char-class { '['  ']' }
}

class SQLGlob is Glob {
    token match:any { '%' }
    token match:one { '_' }
}

By using a proto token or proto rule we gain a nice way of handling alternative matches within grammars. This allows us to create variety of grammar variants that don’t repeat themselves. It also gives us the flexibility to allow developers using your grammars to subclass and modify them to suit their custom needs without monkey patching.

A working example of something like this can be found in the source for IO::Glob. (Though, much of that having been written long enough ago that I’m not sure it is the best possible example of what to do, but it is an example.)

Grammars are a really nifty way of grouping a set of regexes together to build a complete parser as a unit. They are easy to read, provide tools that allow you to avoid repeating yourself, and I find them to be an enjoyable tool to play with and use in Perl 6.

Cheers.

Day 14 – Targetting MoarVM, the Wrong Way

MoarVM is a virtual machine specifically designed to be a backend for the NQP compiler toolchain in general and the Rakudo Perl 6 compiler in particular.

It is not restricted to running Perl 6, though, and if anyone wants to implement their own language on top of it, Jonathan has been kind enough to provide free course material that walks you through the process. In particular, the code examples for PHPish and Rubyish are worth a quick look to see how things are supposed to work.

However, where there’s a Right Way of doing things, there’s also a Wrong Way, and that’s what we’re gonna look at today!

Generating Bytecode

MoarVM bytecode is generated from MAST trees, defined in lib/MAST/Nodes.nqp of your MoarVM checkout. The file states:

# This file contains a set of nodes that are compiled into MoarVM
# bytecode. These nodes constitute the official high-level interface
# to the VM. At some point, the bytecode itself will be declared
# official also. Note that no text-based mapping to/from these nodes
# will ever be official, however.

This has historical reasons: Parrot, the VM that Rakudo used to target, had an unhealthy overreliance on its textual intermediate representation PIR. Personally, I think it is a good idea to have some semi-official text-based bytecode representation – you just shouldn’t use it as the exchange format between compilation stages.

That’s where doing things the Wrong Way come in: During the last two weeks, I’ve started writing an assembler targetting MAST and a compiler for a tiny low-level language targetting this assembly dialect, doing exactly what I just told you not to do.

Why did I? What I hope to accomplish eventually is providing a bootstrapped alternative to the NQP toolchain, and you have to start your bootstrapping process somewhere.

Currently, only a few bits and pieces have been implemented, but these bits and pieces are somewhat functional and you can do such useful things as echo input from stdin to stdout:

$ cat t/echo.tiny
fn main() {
    obj stdin = getstdin
    do {
        str line = readline stdin
        int len = chars line
        done unless len
        print line
        redo
    }
    exit 0
}

You can either run the code directly

$ ./moartl0 --run t/echo.tiny

compile it first

$ ./moartl0 --compile t/echo.tiny

$ moar t/echo.moarvm

or take a look at the generated assembly

$ ./moartl0 --dump t/echo.tiny
.hll tiny
.frame main
.label bra0_main
    .var obj v0_stdin
    getstdin $v0_stdin
.label bra1_do
    .var str v1_line
    readline_fh $v1_line $v0_stdin
    .var int v2_len
    chars $v2_len $v1_line
    unless_i $v2_len @ket1_do
    print $v1_line
    goto @bra1_do
.label ket1_do
    .var int i0
    const_i64 $i0 0
    exit $i0
.label ket0_main
# ok

There isn’t really anything fancy going on here: Text goes in, text goes out, we can explain that.

Note that the assembly language is not yet finalized, but so far I’ve opted for a minimalistic syntax that has VM instructions separated from its operands by whitespace and accompanied by assembler directives prefixed with a ..

Under the Hood

If you were to look at the source code of the compiler (as we probably should – this is supposed to be the Perl 6 advent calendar, after all), you might discover some useful idiom likes using a proto declaration

proto MAIN(|) {
    CATCH {
        ... # handle errors
        exit 1;
    }

    ... # preprocess @*ARGS
    {*}
}

to accompany our multi MAIN subs that define the command line interface.

However, you would also come across things that might not necessarily be considered best practice.

For one, the compiler is not reentrant: In general, we’re supposed to pass state along the call chain either in the form of arguments (the implicit self parameter of methods is a special case of that) or possibly as dynamic variables. When writing compilers specifically, the latter tend to be useful to implement recursive declarations like nested lexical scopes: a lexical frame of the target language will correspond to a dynamic frame of the parser. If you don’t care about reentrancy, though, you can just go with global variabes and use the temp prefix to achieve the same result.

For another, the compiler also doesn’t use grammars, but instead, the body of the line-based parsing loop is a single regex, essentially

# next-line keeps track of line numbering and trims the string
while ($_ := next-line) !=:= IterationEnd { /^[
    | ['#'|$]                       # ignore comments and empty lines
    | (:s ld (\w+)'()' '{'${ ... }) # start of load frame definition
    | (:s fn (\w+)${ ... })         # forward declaration of a function
    | ...                           # more statements
    || {bailout}
]/ }

The blocks { ... } represent the actions that have been embedded into the regex after $ anchors terminating each line.

That’s not really a style of programming I’d be comfortable advocating for in general – but Perl being Perl, There’s More Than One Way to Do It: For better or worse, Perl 6 gives programmers a lot of freedom to structure code how they see fit. As the stage 0 compiler is supposed to be supplanted anyway, I decided to have some fun instead of crafting a proper architecture.

In comparison, the assembler implemented in NQP is far more vanilla, with state held by an actions object.

But… Why?

The grinches among you may ask, What is this even doing here? Is this just someone’s personal side project that just happens to be written in Perl 6, of no greater use to the community at large?

Well, potentially, but not necessarily:

First, I do plan on writing a disassembler for MoarVM bytecode, and that may come in handy for bug hunting, testing or when looking for optimization opportunities.

Second, when running on MoarVM, Perl 6 code may load and interact with compilation units written in our tiny language or even hand-optimized VM assembly. The benefit over something like NativeCall is that we never leave the VM sandbox, and in contrast to foreign code that has to be treated as black boxes, the JIT compiler will be able to do its thing.

Third, an expansion of the MoarVM ecosystem might attract the attention of language enthusiasts beyond the Perl community, with at least the chance that MoarVM could deliver on what Parrot promised.

However, for now that’s all just idle speculation – remember, all there is right now is a two weeks old toy I came up with when looking for something to write about for this advent calendar. It’s a very real possibility that this project will die a quiet death before amounting to anything. But on the off chance it does not, it’s nice to have a hot cup of the preferred beverage of your choice and dream about a future where MoarVM rises as a butterfly-winged phoenix from the ashes of a dead parrot….

Day 13 – Audio Streaming done Completely Wrong

Starting out

I made the audio streaming source client Audio::Libshout about a year and a half ago, it works quite well: with the speed improvements in Rakudo I have been able to stream 320kb/s MP3 without a problem, but it always annoyed me that it was difficult to test properly and even to test it at all required an Icecast server. Even with an icecast server that I could stream to, it would be necessary to actually listen to the stream in order to determine whether the stream was actually functioning correctly.

This all somewhat came to a head earlier in the year when I discovered that even the somewhat rudimentary tests that I had been using for the asynchronous streaming support had failed to detect that the stream wasn’t being fed at all. What I really needed was a sort of dumb streaming server that could act in place of the real icecast and could be instrumented to determine whether a connection was being made and that the correct audio data was being received. How hard could it be? After all it was just a special kind of web server.

I should have known from experience that this was a slippery slope, but hey.

A little aside about Audio Streaming

An icecast streaming server is basically an HTTP server that feeds a continuous stream of encoded audio data to the client listeners who connect with an HTTP GET request, the data to be streamed is typically provided by a source client which will connect to the server, probably authenticate using HTTP Authentication, and start sending the data at a steady rate that is proportional to the bitrate of the stream. libshout connects with a custom request method of SOURCE which is inherited from its earlier shoutcast origins, though icecast itself understands PUT as well for the source client. Because it is the responsibility of the listening client to supply the decoded audio data to the soundcard at exactly the right rate and the encoded data contains the bitrate of the stream as transmitted from the source the timing demands on the server are not too rigorous: it just has to be consistent and fast enough that a buffer on the client can be kept sufficiently full to supply the audio data to the sound card. Icecast does a little more in detail to, for instance, adjust for clients that don’t seem to be reading fast enough and so forth, but in principle it’s all relatively simple.

Putting it together

As might be obvious by now, an audio streaming server differs from a typical HTTP server in that rather than serving some content from disk or generated by the program itself for example, it needs to share data received on one client connection with one or more other client connections. In the simplest C implementation one might have a set of shared buffers, one of which is being populated from the source connection at any given time, whilst the others are being consumed by the client connections, alternating on filling and depletion. Whether the implementation settles on a non-blocking or threaded source possibly the most critical part of the code will be the synchronisation between the source writer and the client readers to ensure that a buffer is not being read from and written to at the same time.

In Perl 6 of course you’d hope you didn’t need to worry about these kind of annoying details as there are well thought out concurrency features that abstract away most of the nasty details.

From one to many

Perhaps the simplest program that illustrates how easy this might be would be this standalone version of the old Unix service chargen :

use v6.c;

my $supplier = Supplier.new;

start {
    loop {
        for ( 33 ... 126 ).map( { Buf.new($_) }) -> $c {
            sleep 0.05;
            $supplier.emit($c);
        }
    }
}

my $sig = signal(SIGPIPE).tap( -> $v {
    $sig.close;
});

react {
    whenever IO::Socket::Async.listen('localhost', 3333) -> $conn {
        my $p = Promise.new;
        CATCH {
            when /'broken pipe'/ {
                if $p.status !~~ Kept {
                    $p.keep: "done";
                }
            }
        }

        my $write = $supplier.Supply.tap( -> $v {
            if $p.status ~~ Planned {
                $conn.write: $v;
            }
            else {
                $write.close;
            }
        });
    }
}

In this the Supplier is being fed asynchronously to stand in for the possible source client in our streaming server, and each client that connects on port 3333 will receive the same stream of characters – if you connect with two clients (telnet or netcat for instance,) you will see they are getting the same data at roughly the same time.

The Supplier provides a shared sequence of data of which there can be many consumers, so each connection provided by the IO::Socket::Async will be fed the data emitted to the Supplier starting at the point the client connected.

The CATCH here is to deal with the client disconnecting, as the first our code will know about this is when we try to write to the connection, we’re not expecting any input from the client we can check and besides the characters becoming available to write may happen sooner than the attempt to read may register the close, so, while it may seem like a bit of a hack, it’s the most reliable and simple way of doing this: protecting from further writes with a Promise. If, by way of experiment, you were to omit the CATCH you would find that the server would quit without warning the first time the first client disconnected.

I’ll gloss over the signal outside the react as that only seems necessary in the case where we didn’t get any input data on the first connection.

Making something nearly useful

The above example is almost all we need to make something that you might be able to use, all we need is for it to handle an HTTP connection from the clients and get a source of actual MP3 data into the Supply and we’re good. To handle the HTTP parts we’ll just use the handy HTTP::Server::Tiny which will conveniently take a Supply that will feed the output data, so in fact we end up with quite a tiny program:

use HTTP::Server::Tiny;

sub MAIN(Str $file) {

    my $supplier = Supplier.new;

    my $handle = $file.IO.open(:bin);

    my $control-promise = start {
        while !$handle.eof {
            my $c = $handle.read(1024);
            $supplier.emit($c);
        }
    }

    sub stream(%env) {
        return 200, [ Content-Type => 'audio/mpeg', Pragma => 'no-cache', icy-name => 'My First Streaming Server'], $supplier.Supply;
    }
    HTTP::Server::Tiny.new(port => 3333).run(&stream);
}

The HTTP::Server::Tiny will run the stream subroutine for every request and we need to do is return the status code, some headers, and a supply from which the output data will be read, the client connection will be closed when the Supply is done (that is when the done method is called on the source Supplier.) It couldn’t really be much more
simple.

Just start the program with the path to a file containing MP3 audio and then point your favourite streaming client at port 3333 on your localhost and you should get the stream, I say should as it makes no attempt to regulate the rate at which the audio data is fed to the client. But for constant bit-rate MP3 data and a client that will buffer as much as it
can get it works.

Of course a real streaming server would read the data frame by frame and adjust the rate according to the bit-rate of each frame. I actually have made (but not yet released,) Audio::Format::MP3::Frame to help do this, but it would be over-kill for this example.

Relaying a stream

Of course the original intent of this streaming server was to be able to test a streaming source client, so we are going to have to add another part that will recognise a source connection, read from that and relay it to the normal clients in a similar way to the above.

You’ll recall that the libshout client library will connect with a request method of SOURCE so we can adjust the file streaming example to identify the source connect and feed the Supplier with the data read from the connection:

use HTTP::Server::Tiny;

sub MAIN() {

    my $supplier = Supplier.new;

    my Bool $got-source = False;

    sub stream(%env) {
        if %env<REQUEST_METHOD> eq 'GET' {
            if $got-source {
                return 200, [ Content-Type => 'audio/mpeg', Pragma => 'no-cache', icy-name => 'My First Streaming Server'], $supplier.Supply;
            }
            else {
                return 404, [ Content-Type => 'text/plain'], "no stream connected";
            }
        }
        elsif %env<REQUEST_METHOD> eq 'SOURCE' {
            my $connection = %env<p6sgix.io>;

            my $finish-promise = Promise.new;

            $connection.Supply(:bin).tap(-> $v {
                $supplier.emit($v);
            }, done => -> { $finish-promise.keep: "done" });

            $got-source = True;
            return 200, [ Content-Type => 'audio/mpeg' ], supply { whenever $finish-promise { done; } };
        }
    }
    HTTP::Server::Tiny.new(port => 3333).run(&stream);
}

You’ll see immediately that nearly all of the action is happening in the request handler subroutine: the GET branch is almost unchanged from the previous example (except that it will bail with a 404 if the source isn’t connected,) the SOURCE branch replaces the reading of the file previously. HTTP::Server::Tiny makes reading the streamed data from the source client really easy as it provides the connected IO::Socket::Async to the handler in the p6sgix.io (which I understand was originally primarily to support the WebSocket module,) theSupply of which is tapped to feed the shared Supplier that
conveys the audio data to the clients. All else that is necessary is to return with a Supply that is not intended to actually provide any data but just to close when the client closes their connection.

Now all you have to do is run the script and feed it with some source client like the following:

use Audio::Libshout;

multi sub MAIN(Str $file, Int $port = 3333, Str $password = 'hackme', Str $mount = '/foo') {

    my $shout = Audio::Libshout.new(:$port, :$password, :$mount, format => Audio::Libshout::Format::MP3);
    $shout.open;
    my $fh = $file.IO.open(:bin);

    while not $fh.eof {
        my $buf = $fh.read(4096);
        say $buf.elems;
        $shout.send($buf);
        $shout.sync;
    }

    $fh.close;
    $shout.close;
}

(Providing the path to some MP3 file again,) then if you connect a streaming audio player you will be getting some audio again.

You might notice that the script can take a password and a mount which aren’t used in this case, this is because Audio::Libshout requires them and also because this is basically the script that I have been using to test streaming servers for the last year or so.

Surprisingly this tiny streaming server works quite well, in testing I found that it ran out of threads before it got too bogged down handling the streams with multiple clients, showing how relatively efficient and well thought out the Perl 6 asynchronous model is. And how simple it is to put together a program that would probably require a lot more
code in many other languages.

Where do we go from here

I’m pretty sure that you wouldn’t want to use this code to serve up a popular radio station, but it would definitely be sufficient for my original testing purposes with a little additional instrumentation.

Of course I couldn’t just stop there so I worked up the as yet unreleased Audio::StreamThing which uses the same basic design with shared supplies, but works more like Icecast in having multiple mounts for individual streams, provision for
authentication and better exception handling.

If you’d find it useful I might even release it.

Postscript

I’d just like to get a mention in for the fabulous DJ Mike Stern as I always use his recorded
sets for testing this kind of stuff for some reason.

Have fun and make more noise.

Day 12 – Avoiding User Namespace Pollution with Perl 6 Modules

A bit of trickery is needed

As Perl 5 module authors, we were able to allow selective exporting easily by use of the @EXPORT_OK array where we listed all the objects that the user of our module could import by name, thereby giving the user full control over possible namespace clashes. However, in spite of all the new goodies provided by Perl 6, that feature is not yet available to us Perl 6 module authors. (For more information see issue #127305 on https://rt.perl.org.)

But the good news is there is a way to get the same effect, albeit with a bit more effort. The trick is in using the full power of tags with the export trait available with every object in a module.

One might get a little glassy-eyed when reading the Perl 6 docs concerning module import and export, of which a portion of code is shown here:

unit module MyModule;
my package EXPORT::DEFAULT {
    our sub foo { ... }
}
my package EXPORT::other {
    our sub bar { ... }
}

where each object to be exported is enclosed in a named export category block. That method has serious limitations (in addition to requiring another set of child blocks):

  • the objects cannot appear in more than one grouping and
  • the objects cannot be assigned other tags via the is export adjective on individual objects.

Consequently, to emulate Perl 5’s @EXPORT_OK array, each object would have to be wrapped in a unique block and could not have multiple tags. But then the docs say that all that ugly package exporting code can be replaced by this much simpler code for each item to be exported:

sub foo is export { ... }
sub bar is export(:other) { ... }

And therein is our solution: We will use export tags to (1) allow the user to import only what he or she desires to import or (2) import everything. The module author can also be helpful by adding other tags which will allow importing sets of objects (but such sets are not always so easily determined in practice).

Note one restriction before we continue: only valid identifier names are allowed in the tags, so we can’t use sigils or other special characters as we could in Perl 5.

For an example we use a simple, but unusual, Perl 6 module with multiple objects to be exported (and all have the same name, probably not a good practice, but handy for demonstration of the disambiguation problem):

unit module Foo;
sub bar() is export { say "bar" }
constant $bar is export = 99;
constant %bar is export = (a => 1);
constant @bar is export = <a b>;

We can then have access to all of those objects by merely using the module in a program file:

use Foo;
bar;
say $bar;
say %bar;
say @bar;

Executing it yields:

bar
99
{a => 1}
[a b]

as expected. Now let’s modify the module by adding an export tag to the bar routine:

sub bar() is export(:bar) { say "bar" }

and then, when our program is executed, we get:

===SO6RRY!=== Error while compiling...
Undeclared routine:
    bar used at line 7. Did you mean 'VAR', 'bag'?

so we modify our use line in the program to:

use Foo :bar;

and execute the program again to get:

===SORRY!=== Error while compiling...
Variable '$bar' is not declared. Did you mean '&bar'?

It seems that as soon as we, the users, restrict the use statement with a tag, the non-tagged objects are not available! Now we, the authors, have two choices:

  • tag all objects with the same tag or
  • tag them with separate tags.

If we tag them the same, then all will be available—which is probably not a problem. However, that would defeat our goal of having unique tags.

If we name them with unique tags, we will need some way to distinguish them (remember, we can’t use their sigils), which leads us to a possible convention we show here in the final module:

unit module Foo;
sub bar() is export(:bar :SUB) { say "bar" }
constant $bar is export(:bar-s :CONST) = 99;
constant %bar is export(:bar-h :CONST) = (a => 1);
constant @bar is export(:bar-a :CONST) = <a b>;

Notice the suffixes on the non-subroutine objects (all constants) have their kind indicated by the ‘-X’ where the ‘X’ is ‘s’ for scalar, ‘h’ for hash, and ‘a’ for array. We have also added an additional tag which groups the objects into the general categories of subroutines or constants so the user could import, say, just the constants as a set, e.g, use Foo :CONST;.

Now we just change the use line to

use Foo :bar, :bar-s, :bar-h, :bar-a;

and get

bar
99
{a => 1}
[a b]

again.

Thus the good news is we can add a tag to the export trait that is the same as the name of the subroutine, but the sad news is we can’t use the appropriate sigil as in Perl 5 to disambiguate among objects with the same name. The solution to that is to use some convention as demonstrated above (akin to the ugly Hungarian notation in C) that will have the same effect.

Of course in this somewhat-contrived example, the user could have imported all the objects at once by using the special, automatically-defined tag `:ALL:

use Foo :ALL;

but the author has provided users of the module complete flexibility in its application for them.

Conclusion

We now have a way to protect the user’s namespace by requiring him or her to selectively import objects by “name” (or perhaps other subsets of objects) unless the user chooses to import everything. The only downsides I see are:

  • the extra effort required by the module author to explicitly tag all exportable objects and
  • the restriction on selecting an appropriate tag for different objects with the same name.

I love Perl 6, and the old Perl motto TIMTOWTDI still holds true as we’ve just seen!

Merry Christmas! // Happy Holidays! I hope you will enjoy the gift of Perl 6 throughout the coming year!


Note: Thanks to Moritz Lenz (IRC #perl6: user moritz) for constructive comments about the bad practice of exporting variables—hence the example now exports constants instead.

 

Day 11 — Perl 6 Core Hacking: It Slipped Through The QASTs

One of the great things about Perl 6 is how accessible to regular users the compiler is to hack on. Easy bugs require nothing but knowledge of Perl 6 itself, since a lot of the Rakudo compiler is written in Perl 6. Slightly tougher bugs require knowledge of NQP, and tougher still are bugs involving Grammar and Actions. Things progress further in difficulty from there, going as far as assembly hacking on VM level, but today, we’ll stick around in Rakudo land. We have a task at hand!

Santa is having some difficulties generating his Naughty-or-Nice list due to a bug in Perl 6. He traced it down to the use of the S/// substitution operator with :g modifier that for some reason returns an empty list instead of the original string if no matches were made:

say S:g/naughty// with 'only nice list';
# OUTPUT: ()

Time to dig in and fix this, if anyone is to get any presents this year!

The Bots To The Rescue

The first thing to do when fixing a bug is to find out when it first appeared. AlexDaniel++ and MasterDuke++ implemented several IRC bots that make this task extremely simple. They are available in #perl6 and #perl6-dev IRC channels, and you can play with them in #zofbot IRC channel, without annoying anyone. We’ll be using bisectable6 bot to find out when the S/// operator got broken:

<Zoffix> bisectable6, help
<bisectable6> Zoffix, Like this: bisectable6: old=2015.12 new=HEAD exit
    1 if (^∞).grep({ last })[5] // 0 == 4 # RT128181

Just give the bot a piece of code, optionally specifying the starting and ending commits, and it’ll bisect by either the exit code, or failing at that, by output.

<Zoffix> bisectable6, S:g/d// given 'abc'
<bisectable6> Zoffix, Bisecting by output (old=2016.10 new=524368c)
    because on both starting points the exit code is 0
<bisectable6> Zoffix, bisect log:
    https://gist.github .com/c2cf9c3a7b6d13a43c34f64b96090e31
<bisectable6> Zoffix, (2016-10-23)
    https://github  .com/rakudo/rakudo/commit/b7201a8f22338a906f2d8027a21387e8f5c77f41

The last link is the interesting bit, it tells us the S:g/// was working fine until that commit. The commit does seem related—it’s the refactor lizmat++ did to make .match 150%–1400% faster—but it’s quite big and it’s not obvious how it’s linked to the workings of the S/// operator. Let’s find out, shall we?

How Do You Spell That?

We can specify the --target command line argument to perl6 executable to ask it for the output of a particular stage of the program (run perl6 --statestats -e '' to see names of all stages). Let’s output the parse stage, to find out which tokens in the Grammar we should look into:

zoffix@VirtualBox:~/CPANPRC/rakudo$ ./perl6 --target=parse -e 'S:g/d//'
- statementlist: S:g/d//
  - statement: 1 matches
    - EXPR: S:g/d//
      - value: S:g/d//
        - quote: S:g/d//
          - sym: S
          - rx_adverbs: :g
            - quotepair: 1 matches
              - identifier: g
          - sibble: /d//
            - right:
            - babble:
              - B:
            - left: d
              - termseq: d
                - termaltseq: d
                  - termconjseq: 1 matches
                    - termalt: 1 matches
                      - termconj: 1 matches
                        - termish: 1 matches
                          - noun: 1 matches
                            - atom: d

There are some general tokens in the output—such as statementlist, statement, EXPR, and value—we can just gloss over. They are about statements and we want stuff for the operator itself, so the interesting bit start with this:

        - quote: S:g/d//
          - sym: S
          - rx_adverbs: :g
            - quotepair: 1 matches
              - identifier: g

Let’s pop open the Grammar in our text editor and locate a token called quote. It can also be a rule, regex or method, but tokens are most common. The first thing we can locate is this:

proto token quote { <...> }
token quote:sym<apos>  {
    :dba('single quotes') "'" ~ "'"    
    <nibble(self.quote_lang(%*LANG<Quote>, "'", "'", ['q']))>
}
token quote:sym<sapos> {
    :dba('curly single quotes') "‘" ~ "’"
    <nibble(self.quote_lang(%*LANG<Quote>, "", "", ['q']))>
}

The Grammar that parses Perl 6 isn’t much different from the grammar you’d use as a user of Perl 6, so most of it probably looks familiar to you. The quote token is a proto regex, so looking further at the output --target=parse gave us, we see we need :sym<S> variant of it.

Scrolling a bit through the quote‘s candidates, we finally come across :sym<s> that sets the <sym> capture to either s or S:

token quote:sym<s> {
    <sym=[Ss]> (s)**0..1
    :my %*RX;
    :my $*INTERPOLATE := 1;
    :my $*SUBST_LHS_BLOCK;
    :my $*SUBST_RHS_BLOCK;
    {
        %*RX<s> := 1 if $/[0]
    }
    <.qok($/)>
    <rx_adverbs>
    <sibble(%*RX<P5> ?? %*LANG<P5Regex> !! %*LANG<Regex>, %*LANG<Quote>, ['qq'])>
    [ <?{ $<sibble><infixish> }> || <.old_rx_mods>? ]

}

So this token handles both s/// and S/// operators and its body is unimpressive: it seems to be no more than than some set up work. With the name of the token in hand, we now know what to look for in the Actions: method quote:sym<s>.

While finding it in actions is easy… it’s quite a biggie, with 177 lines of code to its name. However, someone nice left us a comment that fits into our puzzle:

method quote:sym<s>($/) {
    # We are emulating Str.subst/subst-mutate here, by calling match,
    # assigning the result to a temporary variable etc.
    ...

Recall bisectable6‘s results? The commit it pointed out was work on the .match method and according to the comment for S/// operator, it uses .match to do its stuff. Let’s execute that method on builds before and after the commit bisectable6 found for us. There’s another handy bot to do that for us: commitable6.

Give it a commit SHA or one of the release tags along with code to run and it’ll give you output for that code on that commit:

<Zoffix> committable6, 2016.11 say 'abc'.match: :g, /d/
<committable6> Zoffix, ¦«2016.11»: ()

<Zoffix> committable6, 2016.10 say 'abc'.match: :g, /d/
<committable6> Zoffix, ¦«2016.10»: ()

We ran the code on 2016.11 and 2016.10 releases and the output indicates there’s no difference… or is there? The problem with using say as a debugging tool is it often omits things we may find useful. A better alternative is the dd routine that is a Rakudo-specific utility dumper sub that’s not part of standard Perl 6 language. Give it some args and it’ll dump them out. Let’s give it a spin:

<Zoffix> committable6, 2016.11 dd 'abc'.match: :g, /d/
<comittable6> Zoffix, ¦«2016.11»: slip()

<Zoffix> committable6, 2016.10 dd 'abc'.match: :g, /d/
<committable6> Zoffix, ¦«2016.10»: ()

Aha! Another puzzle piece! When :g adverb is in use, on failed matches .match used to return an empty list, but after lizmat++’s .match improvements, it started to return Empty, which is an empty Slip. Slips tend to flatten themselves out into the outer container, so perhaps that’s causing an issue in the S///‘s action method? Let’s take a closer look at it.

Slippety Slip

A bird’s-eye view of method quote:sym<s> action shows it does some setup work and then codegens a QAST (“Q” Abstract Syntax Tree). It’d be helpful to take a look at what it generates.

One method of doing so is using the same --target feature we’ve used to get the parse stage, except we’d use the ast stage. So the command would be this:

perl6 --target=ast -e 'S:g/d//'

If you actually run that, you’ll get a text wall of QAST, and it may be tough to spot which are the bits actually generated by the S/// operator. Luckily, there’s a better way! The QAST node objects have .dump method that dumps them the same style as what you see in --target=ast output. So checkout the compiler’s repo if you haven’t already done so, pop open src/Perl6/Actions.nqp file, go to the end of method quote:sym<s> and stick note($past.dump) in there to print the dump of the QAST generated for the S/// operator:

    ...
    );
    $past.annotate('is_S', $<sym> eq 'S');
    note($past.dump); # <----------------------- like that
    make WANTED($past, 's///');  # never carp about s/// in sink context
}

(Why is it called $past and not $qast? Historical reasons: QAST used to be PAST, for Parrot Abstract Syntax Tree).

Now, compile Rakudo:

perl Configure.pl --gen-moar --gen-nqp --backends=moar
make
make test
make install

And execute our buggy S/// match to make the line we added print out S///‘s QAST:

zoffix@VirtualBox:~/CPANPRC/rakudo$ ./perl6 -e 'S:g/d// given "abc"'
- QAST::Op(locallifetime)  :is_S<?> S:g/d//
  - QAST::Stmt
    - QAST::Var(local subst_result_1 :decl(var))
    - QAST::Op(bind)
      - QAST::Var(local subst_result_1)
      - QAST::Op(callmethod match)  S:g/d//
        - QAST::Var(lexical $_) <wanted>
        - QAST::WVal(Regex)  :code_object<?> :past_block<?>
        - QAST::IVal+{QAST::SpecialArg}(1 :named<g>)
    - QAST::Op(p6store)
      - QAST::Op(call &infix:<,>)
        - QAST::Var(lexical $/)
      - QAST::Var(local subst_result_1)
    - QAST::Op(if)
      - QAST::Op(unless)
        - QAST::Op(istype)
          - QAST::Var(local subst_result_1)
          - QAST::WVal(Match)
        - QAST::Op(if)
          - QAST::Op(istype)
            - QAST::Var(local subst_result_1)
            - QAST::WVal(Positional)
          - QAST::Op(callmethod elems)
            - QAST::Var(local subst_result_1)
      - QAST::Op(call &infix:<=>)
        - QAST::Var(lexical $/) <wanted>
        - QAST::Op(callmethod dispatch:<!>)
          - QAST::Op(callmethod Str)
            - QAST::Var(lexical $_) <wanted>
          - QAST::SVal(APPLY-MATCHES)
          - QAST::WVal(Str)
          - QAST::Var(local subst_result_1)
          - QAST::Op(p6capturelex)  :code_object<?> :past_block<?>
            - QAST::Op(callmethod clone)
              - QAST::WVal(Code)  :code_object<?> :past_block<?>
          - QAST::Var(lexical $/)
          - QAST::IVal(1)
          - QAST::IVal(0)
          - QAST::IVal(0)
          - QAST::IVal(0)
          - QAST::IVal(0)
      - QAST::Op(p6store)
        - QAST::Op(call &infix:<,>)
          - QAST::Var(lexical $/)
        - QAST::Var(lexical $_) <wanted>
    - QAST::Stmt
    - QAST::Var(lexical $/)

There are docs for types of QAST you see here, or we can just wing it.

We callmethod match and bind the result to subst_result_1:

    - QAST::Var(local subst_result_1 :decl(var))
    - QAST::Op(bind)
      - QAST::Var(local subst_result_1)
      - QAST::Op(callmethod match)  S:g/d//
        - QAST::Var(lexical $_) <wanted>
        - QAST::WVal(Regex)  :code_object<?> :past_block<?>
        - QAST::IVal+{QAST::SpecialArg}(1 :named<g>)

We call nqp::p6store (p6* ops are documented in Rakudo’s repo), giving it the result of infix:<,>($/) as container and the return of .match call as value:

    - QAST::Op(p6store)
      - QAST::Op(call &infix:<,>)
        - QAST::Var(lexical $/)
      - QAST::Var(local subst_result_1)

We check if anything matched (for :g matches, we check for a Positional that has any .elems in it):

    - QAST::Op(if)
      - QAST::Op(unless)
        - QAST::Op(istype)
          - QAST::Var(local subst_result_1)
          - QAST::WVal(Match)
        - QAST::Op(if)
          - QAST::Op(istype)
            - QAST::Var(local subst_result_1)
            - QAST::WVal(Positional)
          - QAST::Op(callmethod elems)
            - QAST::Var(local subst_result_1)

If we did have matches, call Str!APPLY-MATCHES:

      - QAST::Op(call &infix:<=>)
        - QAST::Var(lexical $/) <wanted>
        - QAST::Op(callmethod dispatch:<!>)
          - QAST::Op(callmethod Str)
            - QAST::Var(lexical $_) <wanted>
          - QAST::SVal(APPLY-MATCHES)
          - QAST::WVal(Str)
          - QAST::Var(local subst_result_1)
          - QAST::Op(p6capturelex)  :code_object<?> :past_block<?>
            - QAST::Op(callmethod clone)
              - QAST::WVal(Code)  :code_object<?> :past_block<?>
          - QAST::Var(lexical $/)
          - QAST::IVal(1)
          - QAST::IVal(0)
          - QAST::IVal(0)
          - QAST::IVal(0)
          - QAST::IVal(0)

If we didn’t have matches, call nqp::p6store, storing the $_ (this is our original string S/// works on) in the $/:

      - QAST::Op(p6store)
        - QAST::Op(call &infix:<,>)
          - QAST::Var(lexical $/)
        - QAST::Var(lexical $_) <wanted>

Since we know the commit bisectable6 found makes .match return an empty slip for failed matches, it’s that last bit of QAST that should look suspicious, since slips flatten themselves out. We’ll return to why we’re storing into an &infix:<,>($/) rather than into $/ directly, but first, let’s write the NQP equivalent of such a setup.

We have two variables: $/ with Empty and $_ with our original string. The QAST::Op node maps out to an nqp op with the same name, so our suspicious bit looks something like this:

use nqp;

$_ = 'abc';
$/ = Empty;
nqp::p6store( &infix:<,>($/), $_);

Yet another helpful bot, camelia, lets us run a piece of code straight from IRC. Just use trigger m: with some code. Let’s try it out:

<Zoffix> m: use nqp; $_ = 'abc'; $/ = Empty;
    nqp::p6store( &infix:<,>($/), $_); dd $/;
<camelia> rakudo-moar ea2884: OUTPUT«Slip $/ = slip$()␤»

<Zoffix> m: use nqp; $_ = 'abc'; $/ = List.new;
    nqp::p6store( &infix:<,>($/), $_); dd $/;
<camelia> rakudo-moar ea2884: OUTPUT«Str $/ = "abc"␤»

The results show that when $/ is an Empty, it ends up still being it after the p6store, while if $/ is an empty List, it happily takes a string. We finally connected the S/// operator with the commit that introduced the bug and found why it occurs (although, slips behaving like that may be a bug of its own). Let’s trace where that Empty in Str.match comes from and why it’s there.

What Sourcery Is This?

There’s another bot (it’s the future! people have lots of bots!), SourceBaby, that can give you a link to source code for a routine. It uses CoreHackers::Sourcery module under the hood and takes arguments to give to its sourcery routine. Trigger it with the s: trigger:

<Zoffix> s: 'abc', 'match', \(/d/, :g)
<SourceBaby> Zoffix, Sauce is at
    https://github.com/rakudo/rakudo/blob/164eb42/src/core/Str.pm#L946

We gave it an object to call a method on (a Str), a string with the method name (match), and a Capture with which arguments the method is to be called. In return, it gave a URL to the multi that handles those args:

multi method match(Regex:D $pattern, :global(:$g)!, *%_) {
    nqp::if(
      nqp::elems(nqp::getattr(%_,Map,'$!storage')),
      self!match-cursor(nqp::getlexcaller('$/'),
        $pattern($cursor-init(Cursor,self,:0c)), 'g', $g, %_),
      nqp::if(
        $g,
        self!match-list(nqp::getlexcaller('$/'),
          $pattern($cursor-init(Cursor,self,:0c)),
          CURSOR-GLOBAL, POST-MATCH),
        self!match-one(nqp::getlexcaller('$/'),
          $pattern($cursor-init(Cursor,self,:0c)))
      )
    )
}

No Empty here, but we can see that when $g is true, we call self!match-list. It’s a private method, so SourceBaby would not be able to help with it. Let’s find it by searching the same source file:

# Create list from the appropriate Sequence given the move
method !match-list(\slash, \cursor, \move, \post) {
    nqp::decont(slash = nqp::if(
      nqp::isge_i(nqp::getattr_i(cursor,Cursor,'$!pos'),0),
      Seq.new(POST-ITERATOR.new(cursor, move, post)).list,
      Empty,
    ))
}

And there’s our Empty! The commit message doesn’t mention why we changed from an empty List to an Empty, there are no comments in the source code explaining it, so we’ll have to resort to the most technologically non-advanced debugging tool in our arsenal… asking people.

The Dev IRC Channel

If you have questions about core development, join #perl6-dev IRC channel on Freenode. In this case, we can ask lizmat++ if she remembers whether there was a reason for that Empty.

If the person you’re trying to reach isn’t currently online, you can use the messaging bot, using the .tell trigger, followed by the person’s nick, followed by message. When the bot sees the person talk, it will deliver the message.

<babydrop> .ask stmuk_ so is `zef` now the installer being
    shipped with R*? I notice our REPL message still
    references panda; wondering if that should read zef now
<yoleaux2> babydrop: I'll pass your message to stmuk_.

After the discussion about the Empty, there doesn’t appear to be any specific reason to return it in this case, so we’ll change it to return an empty List instead, just as its old behavior was, and that will also fix our bug. The new !match-list then looks like this:

method !match-list(\slash, \cursor, \move, \post) {
    nqp::decont(slash = nqp::if(
      nqp::isge_i(nqp::getattr_i(cursor,Cursor,'$!pos'),0),
      Seq.new(POST-ITERATOR.new(cursor, move, post)).list,
      List.new,
    ))
}

Compile the compiler; this time we can just run make install, since everything is already configured and pre-built from the last time we compiled:

make install

Check the change did fix the bug:

zoffix@VirtualBox:~/CPANPRC/rakudo$ ./perl6 -e 'say S:g/naughty// with "only nice list"'
only nice list

And run the test suite:

TEST_JOBS=6 make spectest

The TEST_JOBS env var lets you run multiple test files at once and the optimal value to set it at is around 1.3 times the number of cores in your computer. If you have a very meaty box (or endless patience), you can run make stresstest instead, for a more thorough test run.

With the spectest passing all of it’s tests, we are ready to finish off our work.

Test It!

The test suite is located in t/spec and is automatically checked out from its repo when you run make spectest. You can also simply delete that directory and clone your own fork as t/spec instead.

Usually, it’s easy to locate the file where the test can go into by running tree -f | grep 'some search term'. We fixed an issue with substitutions, so let’s go for this:

zoffix@VirtualBox:~/CPANPRC/rakudo/t/spec$ tree -f | grep subst
│   ├── ./integration/substr-after-match-in-gather-in-for.t
├── ./S05-substitution
│   ├── ./S05-substitution/67222.t
│   ├── ./S05-substitution/match.t
│   ├── ./S05-substitution/subst.rakudo.moar
│   └── ./S05-substitution/subst.t
│   ├── ./S32-str/substr-eq.t
│   ├── ./S32-str/substr-rw.rakudo.moar
│   ├── ./S32-str/substr-rw.t
│   ├── ./S32-str/substr.t

The ./S05-substitution/subst.t file looks like a decent candidate, pop it open. Bump the plan at the top of file by the number of tests you’re adding, then add the test at the end of the file (or a more appropriate spot):

plan 185;

...

is-deeply (S:g/FAIL// with 'foo'), 'foo',
    'S:g/// returns original string on failure to match';

Run the test file, to ensure everything passes:

make t/spec/S05-substitution/subst.t

And commit! We’re done! Santa’s Naughty-or-Nice list shall work fine from now on.

The Final Mystery

Recall that &infix:<,>($/) thing that was causing the bug when $/ contained an Empty? So what is that all about?

If you don’t know something about Perl 6, just come to our #perl6 IRC channel and ask. This is what I did when I couldn’t understand the purpose of that infix thing and after a long discussion, finding old bug tickets, and testing old bugs… we came to the conclusion these are no longer needed here!

So along with a bug fix, we also cleaned up codegen. At least that’s in theory, perhaps by doing so we created another bug that will send us on yet another great hunting journey.

Conclusion

It’s easy to give a helping hand to the core developers of Perl 6 by fixing some of the bugs. Starting from easy things that require nothing more than knowledge of Perl 6, you can progressively learn more about the internals and fix tougher problems.

The perl6 compiler comes with a number of useful output methods like --target=ast and --target=parse that can aid in debugging. An army of IRC bots makes it easy to navigate source code both in space and time, by either giving you a link to an implementation or producing output of some particular commit.

Lastly, a very valuable resource we have available is the people of Perl 6, who can help you out. Whether you’re digging deep into the guts of the compiler, or just starting out with computer programming.

Join us. We have… bugs to fix!

Day 10 — Let’s learn and try double-ended priority queue with Perl 6

Hello! My name is Itsuki Toyota. I’m a web developer in Japan.

In this post, let me introduce Algorithm::MinMaxHeap.

Introduction

Algorithm::MinMaxHeap is a Perl 6 implementation of double-ended priority queue (i.e. min-max heap).1

Fig.1 shows an example of this data structure:

minmaxheap

Fig.1

This data structure has an interesting property, that is, it has both max-heap and min-heap in a single tree.

More precisely, the min-max ordering is maintained in a single tree as the below description:

  • values stored at nodes on even levels (i.e. min levels) are smaller than or equal to the values stored at their descendants (if any)
  • values stored at nodes on odd levels (i.e. max levels) are greater than or equal to the values stored at their descendants (if any)

Algorithm::MinMaxHeap Internals

I’ll shortly explain what this double-ended priority queue does internally when some frequently used methods (e.g. insert, pop) are called on.2
They maintain the min-max ordering by the following operations:

insert($item)

1. Pushes the given item at the first available leaf position.
2. Checks whether it maintains the min-max ordering by traversing the nodes from the leaf to the root. If it finds a violating node, it swaps this node with a proper node and continues the checking operation.

pop-max

1. Extracts the maximum value node.
2. Fills the vacant position with the last node of the queue.
3. Checks whether it maintains the min-max ordering by traversing the nodes from the filled position to the leaf. If it finds a violating node, it swaps this node with a proper node and continues the checking operation.

pop-min

1. Extracts the minimum value node.
2. Fills the vacant position with the last node of the queue.
3. Checks whether it maintains the min-max ordering by traversing the nodes from the filled position to the leaf. If it finds a violating node, it swaps this node with a proper node and continues the checking operation.

Let’s try !

Then let me illustrate how to use Algorithm::MinMaxHeap.

Example 1

The first example is the simplest one.

Source:

use Algorithm::MinMaxHeap;

my $queue = Algorithm::MinMaxHeap[Int].new; # (#1)
my Int @items = (^10).pick(*);
@items.say; # [1 2 6 4 5 9 0 3 7 8]
$queue.insert($_) for @items; # (#2)
$queue.pop-max.say; # 9 (#3)

In this example, Algorithm::MinMaxHeap[Int].new creates a queue object, where all of its nodes are type Int. (#1)

$queue.insert($_) for @items inserts the numbers in the list. (#2)

$queue.pop-max pops a node which stores the maximum value. (#3)

Example 2

The second example defines user-defined type constraints using subset:

Source (chimera.p6):

use Algorithm::MinMaxHeap;

my subset ChimeraRat of Cool where Num|Rat; # (#1)
my $queue = Algorithm::MinMaxHeap[ChimeraRat].new;

$queue.insert(10e0); # ok
$queue.insert(1/10); # ok
$queue.insert(10); # die # (#2)

Output:

$ perl6 chimera.p6
Type check failed in assignment to @!nodes; expected ChimeraRat but got Int (10)
in method insert at /home/itoyota/.rakudobrew/moar-nom/install/share/perl6/site/sources/240192C19BBAACD01AB9686EE53F67BC530F8545 (Algorithm::MinMaxHeap) line 12
in block  at chimera.p6 line 8

In this example, subset ChimeraRat is a Cool, but only allows Num or Rat. (#1)

Therefore, when you insert 10 to the queue, it returns the error message and die.

It’s because 10 is a Int object. (#2)

Example 3

The third example inserts user-defined classes to the queue.

Source (class.p6):

use Algorithm::MinMaxHeap;

my class State {
 also does Algorithm::MinMaxHeap::Comparable[State]; # (#1) 
 has DateTime $.time;
 has $.payload;
 submethod BUILD(:$!time, :$!payload) { }
 method compare-to(State $s) { # (#2) 
   if $!time == $s.time {
     return Order::Same;
   }
   if $!time > $s.time {
     return Order::More;
   }
   if $!time < $s.time {
     return Order::Less;
   }
 }
}

my @items;

@items.push(State.new(time => DateTime.new(:year(1900), :month(6)),
 payload => "Rola"));
@items.push(State.new(time => DateTime.new(:year(-300), :month(3)),
 payload => "Taro"));
@items.push(State.new(time => DateTime.new(:year(1963), :month(12)),
 payload => "Hanako"));
@items.push(State.new(time => DateTime.new(:year(2020), :month(6)),
 payload => "Jack"));

my Algorithm::MinMaxHeap[Algorithm::MinMaxHeap::Comparable] $queue .= new;
$queue.insert($_) for @items;
$queue.pop-max.say until $queue.is-empty;

Output:

 $ perl6 class.p6
 State.new(time => DateTime.new(2020,6,1,0,0,0), payload => "Jack")
 State.new(time => DateTime.new(1963,12,1,0,0,0), payload => "Hanako")
 State.new(time => DateTime.new(1900,6,1,0,0,0), payload => "Rola")
 State.new(time => DateTime.new(-300,3,1,0,0,0), payload => "Taro")

In this example, the class State does the role Algorithm::MinMaxHeap::Comparable[State], where Algorithm::MinMaxHeap::Comparable[State] defines the stub method compare-to as:
method compare-to(State) returns Order:D { ... };
(#1)

Therefore, in this case, the State class must define a method compare-to so that it accepts a State object and returns an Order:D object. (#2)

Footnote

  • 1 Atkinson, Michael D., et al. “Min-max heaps and generalized priority queues.” Communications of the ACM 29.10 (1986): 996-1000.
  • 2 For more exact information, please read the above paper.

Thank you for reading my post !

Day 9 – A preview of the ‘hackable’ JIT compiler

Among programming languages Perl 6 has an unfortunate but historically well-deserved reputation for being slow, even compared to languages such as Perl 5, Ruby or Python. This is not really an accident, however. From the outset, perl6 was designed to be extensible by the programmer in many different ways, including defining custom operators, dynamic containers for variables, and a powerful meta-object protocol that allow you to modify the behaviour of objects and classes from regular Perl code. In the future, we might well add macros to that list. To this are added a rich set of control structures and phasers. Such features are useful, powerful and fun.

But such features also make it relatively difficult to execute perl6 efficiently, because they introduce overhead and polymorphism. Even before the language was released last year, work had started to improve the efficiency of Rakudo Perl 6. Because Rakudo is complex and layered, this works involves many aspects and many individuals, from core routine efficiency improvements to precompilation of modules. MoarVM also tries to specialize ‘hot’ code for specific object types at runtime, which reduces the polymorphism of that code, and allows simpler operations to be substituted for complex operations. For example, object attribute access can in many cases be reduced to a simple memory read instruction.

Since late summer 2014, MoarVM has contained a JIT compiler that I developed. At its core, JIT compilation reduces the overhead from interpretation. That is to say, rather than fetch new instructions from a bytecode stream and dispatch to execute them, a JIT compiler generates the code to execute these instructions in a series. Ideally, this allows the executed code to ‘flow’ directly through the program without interruption. In practice, there is still the machinery of the VM, for example for garbage collection, to take into account. Also, note that in many VM implementations, the JIT compiler and the type specialization code are combined, whereas in MoarVM they are distinct components.

Since summer 2015 (yes, that long) I’ve been working on an improved backend for that compiler, which was funded by the Perl foundation. One of the goals for the development of that backend (which I misleadingly call the ‘expression JIT compiler’) was to enable extension points for specialized object types and instructions. These extension points are designed to enable a greater range of type specialization and to improve code generation gradually. The best part, I think, is that they should require relatively little knowledge of compiler internals or assembly language. For this advent calendar post, I’d like to demonstrate those extension points.

First, consider the following innocent-looking perl6 code.

sub foo(@a) {
    @a[0] = @a.elems;
}

In case this (obviously critical) `foo` subroutine is invoked hundreds of times, MoarVM will start to optimize it, first by recording what @a really is. In most cases, that will be an instance of the Array class. If it always the same, and supposing the elems method is sufficiently small (which it is), that method call will be inlined into the foo routine. The assignment of @a[0] is somewhat more involved than it looks and what it would be like in a language like Java. In Perl 6, an array holds a sequence of containers, and it is the container to which the value is assigned. So the primitive sequence of operations in ‘foo’ after MoarVM has optimized it is similar to:

my $i := nqp::elems(@a);
my $c := nqp::atpos_o(@a, 0);
nqp::assign($c, $i);

The ‘nqp’ namespace signifies that these are primitive operations in the interpreter. However, they are still polymorphic, because arrays can be represented by multiple different stores. For example, arrays that are used by NativeCall foreign function interface use a different representation (for compatibility reasons) than the arrays natively used by the VM (which are called MVMArray internally). Thus, these operations are polymorphic, and MoarVM needs a dispatch table to find the correct implementation.

If the exact implementation type of @a is known, then this polymorphism is unnecessary. In principle, the type specializer could insert a specialized non-polymorphic instruction for each of those operations for the types found during the recording phase. That is not very practical though as there are many different types and each type would have to support many different operations.

However, with the expression compiler it will be relatively easy to add specialized code for a specific implementation. See for instance the following (somewhat simplified) implementations of the ‘elems’ and ‘atpos_o’ MoarVM instructions, which would work on MVMArray objects:

(template: elems (^getf $1 MVMArray body.elems))

(template: atpos_o 
  (load (index (^getf $1 MVMArray body.slots.o) $2 8)
        (&SIZEOF MVMObject*)))

Such ‘expression templates’ are textual representations of the templates used in the compiler to convert relatively high-level MoarVM bytecode to low-level code which is easier to compile and optimize. Although the code above might look complex, it is actually pretty straightforward:

  • Expressions are grouped by parentheses and can contain subexpression, exactly as in LISP.
  • The ‘template:’ keyword declares a template, and ‘elems’ declares that this template represents the ‘elems’ MoarVM instruction.
  • Numbers with a ‘$’ sigil represent the (input) operands to the MoarVM instruction. In case of elems, the 1st operand is the MVMArray object.
  • ‘load’ and ‘index’ are simple expression nodes, which represent a particular low-level operation.
  • ‘^getf’ is a template macro, which expands to a more complex expression of ‘load’ and ‘addr’ nodes which compute the address of the ‘body.elems’ field in the MVMArray structure and load a value from it.
  • &SIZEOF is an expression that is translated – prior to inclusion by MoarVM – into a C sizeof expression. With the ‘&’ syntax, any C macro can be referenced, and with that any compile-time expression may be evaluated.

Such templates can be added without any knowledge of assembler whatsoever, although some knowledge of C structure and VM internals are probably preferable. I hope that by making this simple enough I can convince others to share the task of JIT compiler development with me :-).

For those who are feeling more adventurous yet, it is also simple to hook into the data-driven code generator known as the tiler. The tiler picks the CPU instructions to implement the low-level code generated from expression templates. Each CPU instruction is represented by a ’tile’, which ‘covers’ part of the expression graph. The tiler tries to pick the optimal instruction so that the entire code graph is covered as cheaply as possible. As the x86 instruction set is very large, it is nearly impossible for me to write tiles for every possible instruction. Adding a tile is not very hard, though. Suppose for instance that we’d be writing a program to compute the XOR of all values in the array:

my $x = 0;
for @a -> $i {
    $x +^= $i;
}

In reality, there are still multiple barriers to implementing this as a tight loop, but suppose that we have taken them down. (We can, it is a matter of time and energy). Then you might find that the sequence of INDEX, LOAD and XOR operations were be inefficient, and you could optimize that into a single instruction. You can declare a tile for that instruction as follows:

(tile: xor_load_idx 
   (xor reg (load (index reg reg $stride) $size)) reg)
  • This uses the same syntax as the expression templates, which is convenient for parsing, but also makes the correspondence between templates and tiles clear
  • The ’tile:’ keyword now declares a tile, and ‘xor_load_idx’ the name of the tile implementation.
  • The words with a $sigil now specify constants. The ‘xor’, ‘load’, and ‘index’ words specify expression nodes (operations).
  • ‘reg’ specifies that this tile consumes values in registers (in the first expression) and yields a value in a register (as the last word in the expression).

The tile implementation would look as follows:

MVM_JIT_TILE_DECL(xor_load_idx) {
    MVMint8 out  = tile->values[0];
    MVMint8 base = tile->values[1];
    MVMint8 idx  = tile->values[2];
    /* this only works on 8-byte sized indexes (for now) */
    ASSERT(tile->args[0] == 8 && tile->args[1] == 8);
    | xor Rq(out), qword [Rq(base)+Rq(idx)*8];
}

The `MVM_JIT_TILE_DECL` macro declares the tile function with all required parameters. The ’tile’ parameter which is declared and passed automatically contains operational details such as the registers used (in ’tile->values’) and constant parameters (in ’tile->args’). The function of the assembly-code fragment that follows the ‘|’ symbol is to declare a piece of machine code that is to be assembled at runtime. For this we use dynasm with a few patches to support picking ‘dynamic’ registers on x86-64. The result of this addition would be that the aforementioned sequence of instructions would be compiled to a single one and your (hypothetical) program would run a few percent faster. Although small, such improvements can add up.

Unfortunately, all of this is not done yet. Although in a way the main body of the compiler exists, some essential pieces (especially the register allocator) need further development, and there are quite a few bugs and oversights that need to be fixed, too. So unfortunately you can’t play with this – yet. However, I do hope that I’ve given you a taste of what you can do when it is finished, which I hope to be soon. And with that, I wish you a happy holiday.