Day 18: Perl 6 powered work flow

Staying in flow while coding can be a challenge. Distractions and pesky syntactic bugs are potential flow stoppers.

Then there is the 7+/-2 short term memory limit that we all have to juggle. Unlike computers, we can’t just add more hardware to increase the size of the brain’s working memory buffer – at least not yet. Keeping in flow requires managing this buffer to avoid blowouts. Fortunately we have computers to help.

The idea of using a computer to extend your memory has been around since the dawn of computing. Way back in 1945 Vannevar Bush envisioned a Memex (MEMory EXtender), an “enlarged intimate supplement to one’s memory”.

In 2017, the humble text file can act like a poor man’s memex. The text file contains a timeline with three sections: Past, Now and Next. It’s kind of like a changelog but with a future too. The past section fills up over time and contains completed tasks and information for later recall. The now section helps focus on the task at hand and the next section queues tasks up to do in the future.

Tasks move through three states: do (+next), doing (!now) and done (-past).

To stay in flow you sometimes need to quickly recall something, log a task to do in the future and focus on making progress in the now. Keeping a 123.do file helps you to offload cognitive overhead while coding.

The format of a 123.do file is simple so you can hack on it directly with your $EDITOR and it’s described by this Perl 6 grammar.

Here is the Perl 6 command line utility that drives it.

123-advent-demo

To install it just:

shell> git clone https://github.com/nige123/app.123.do.git
shell> cd app.123.do
shell> export PATH=$PATH:bin
shell> 123 +7 Merry Christmas
shell> 123 +13 Happy New Year

Day 17: Something about messaging (but I couldn’t think of a snappier title.)

Why messaging?

When I first started thinking about writing an Advent article this year I reflect that I hadn’t really written a great deal of Perl 6 in the past twelve months in comparison to the previous years when I appear to have written a large number of modules. What I have been doing (in my day job at least,) is thinking about and implementing applications that make heavy use of some messaging system. So I thought it would be interesting to bring some of those ideas to Perl 6.

Perl has always had a strong reputation as a “glue language” and Perl 6 has features that take off an run with that, most prominently the reactive and concurrent features, making it ideally suited to creating message based integration services.

What messaging?

At my feet right now is the excellent Enterprise Integration Patterns which I’d recommend to anyone who has an interest (or works in,) the field, despite it being nearly 15 years old now. However it is a weighty tome (literally, it weighs in at nearly one and half kilograms in hard book,) so I’m using it as a reminder to myself not to attempt to be exhaustive on the subject, lest this turn into a book itself.

There are quite a large number of managed messaging systems both free and commercial, using a range of protocols both open and proprietary, but I am going to limit myself to RabbitMQ which I know quite well and is supported in Perl 6 by Net::AMQP.

If you want to try the examples yourself you will need to have access to a RabbitMQ broker (which is available as a package for most operating system distributions,) but you can use the Docker Image, which appears to work quite well.

You will also need to install Net::AMQP which can be done with:

zef install Net::AMQP

In the examples I will be using the default connection details for the RabbitMQ server (that is the broker is running on localhost and the default guest is active,) if you need to supply differing details then you can alter the constructor for Net::AMQP to reflect the appropriate values:

my $n = Net::AMQP.new(
  host => 'localhost',
  port => 5672,
  login => 'guest',
  password => 'guest',
  vhost => '/'
);

A couple of the examples may require other modules but I’ll introduce them as I go along.

Obligatory Hello, World

RabbitMQ implements the rich broker architecture that is described by the AMQP v0.9 specification, the more recent v1.0 specification as implemented by ActiveMQ does away with much of the prescribed broker semantics to the extent that it is basically a different protocol that shares a similar wire format.

Possibly the simplest possible example of sending a message (a producer,) would be:

   use Net::AMQP;

    my $n = Net::AMQP.new;

    await $n.connect;
    my $channel = $n.open-channel(1).result;
    my $exchange = $channel.exchange.result;
    $exchange.publish(routing-key => "hello", body => "Hello, World".encode);
    await $n.close("", "");

This demonstrates most of the core features of RabbitMQ and Net::AMQP.

Firstly you will notice that many of the methods return a Promise that will be mostly kept with the actual returned value, this reflects the asynchronous nature of the broker which sends (im most cases but not all,) a confirmation message (method in AMQP parlance,) when the operation has been completed on the server.

The connect here establishes the network connection to the broker and negotiates certain parameters, returning a Promise which will be kept with a true value if succesfull or broken if the network connection fails, the supplied credentials are incorrect or the server declines the connection for some other reason.

The open-channel opens a logical broker communication channel in which the messages are exchanged, you may use more than one channel in an application. The returned Promise will be kept with an initialised Net::AMQP::Channel object when confirmed by the server.

The exchange method on the channel object returns a Net::AMQP::Exchange object, in the AMQP model all messages are published to an exchange from which the broker may route the message to one or more queues depending on the definition of the exchange from whence the message may be consumed by another client. In this simple example we are going to use the default exchange (named amq.default.)

The publish method is called on the exchange object, it has no return value as it is simply fire and forget, the broker doesn’t confirm the receipt and the delivery or otherwise to a queue is decoupled from the act of publishing the message. The routing-key parameter is, as the name suggests, used by the broker to determine which queue (or queues,) to route the message to. In the case of the default exchange used in this example the type of the exchange is direct which basically means that the messsage is delivered to exactly one consumer of a queue with a matching name to the routing key. The body is always a Buf and can be of an arbitrary length, in this case we are using an encoded string but it could equally be encoded JSON, MessagePack or BSON blob, whatever suits the consuming application. You can infact supply content-type and content-encoding parameters which will be passed on with the message delivered to a consumer if the design of your application requires it, but the broker itself is totally agnostic to the content of the payload. There are other optional parameters but none are required in this example.

Of course we also need something to read the messages that we are publishing (a consumer,) :

use Net::AMQP;

my $n = Net::AMQP.new;

my $connection = $n.connect.result;

react {
    whenever $n.open-channel(1) -> $channel {
        whenever $channel.declare-queue("hello") -> $queue {
            $queue.consume;
            whenever $queue.message-supply.map( -> $v { $v.body.decode }) -> $message {
                say $message;
                $n.close("", "");
                done();
            }
        }
    }
}

Here, rather than operating on an exchange as we did in the producer, we are using a named queue; declare-queue will cause the queue to be created if it doesn’t already exist and the broker will, by default, bind this queue to the default exchange, “binding” essentially means that messages sent to the exchange can be routed to the queue depending on the exchange type, the routing key of the messages and possibly other metadata from the message. In this case the “direct” type of the default exchange will cause the messages to be routed to a queue that matches the routing key (if one exists, the message will be silently dropped if it doesn’t.)

The consume method is called when you are ready to start receiving messages, it returns a Promise that will be kept with the “consumer tag” that uniquely identifies the consumer to the server but, as we don’t need it here, we can ignore it.

Once we have called consume (and the broker has sent the confirmation,) the messages that are routed to our queue will be emitted to the Supply returned by message-supply as Net::AMQP::Queue::Message objects, however as we aren’t interested in the message metadata in this example map is used to create a new Supply with the decoded bodies of the messages; this is safe where, as in this case, you can guarantee that you will be receiving utf-8 encoded but in a real world application you may want to be somewhat more robust about handling the body if you aren’t in control of the sender (which is often the case when integrating with third party applications.) The content-type and content-encoding as supplied when publishing the message are available in the headers attribute (a Hash,) of the Message object, but they aren’t required to be set, so you may want to consider an alternative scheme as suitable for your application.

In this example the connection is closed and the react exited after the first message is received, but in reality you may want remove the lines:

$n.close("", "");
done();

from the inner whenever and if you want to exit on a signal for example add:

whenever signal(SIGINT) {
    $n.close("", "");
    done();
}

within the top level of the react block. However you choose to exit your program you should always call call close on the connection object as it will cause a warning message in the broker logs that might upset the person administering the server if you don’t.

We could of course have used the react syntax in the producer example in a similar way, but it would have added verbosity for little benefit, however in a larger program where you may for instance be processing from, say, a Supply it can work quite nicely:

    use Net::AMQP;
      
    my $supply = Supply.from-list("Hello, World", "Bonjour le monde", "Hola Mundo");
    my $n = Net::AMQP.new;

    react {
        whenever $n.connect {
            whenever $n.open-channel(1) -> $channel {
                whenever $channel.exchange -> $exchange {
                    whenever $supply.map(-> $v { $v.encode }) -> $body {
                        $exchange.publish(routing-key => "hello", :$body );
                        LAST {
                            $n.close("", "");
                            done();
                        }
                    }
                }
            }
        }
    }

Something a bit more useful

You’re probably thinking “that’s all very well, but that’s nothing I couldn’t do with, say, an HTTP client and a small web-server”, well, you’re getting reliable queuing, persistence of unread messages and so forth, but yes, it could be over-kill for a simple application, until you add a requirement to send the messages to multiple, possibly unknown, consumers for example. This kind of pattern is a use of the “fanout” exchange type, which will deliver a message to all queues that are bound to the exchange.

In this example we will need to declare our own queue, in order that we can specify the type, but the producer doesn’t become much more complicated:

use Net::AMQP;

my $n = Net::AMQP.new;
my $con =  await $n.connect;
my $channel = $n.open-channel(1).result;
my $exchange = $channel.declare-exchange('logs', 'fanout').result;
$exchange.publish(body => 'Hello, World'.encode);
await $n.close("", "");

The only major difference here is that we use declare-exchange rather than exchange on the channel to obtain the exchange to which we send the message, this has the advantage of causing the exchange to be created on the broker with the specified type if it doesn’t already exist which is useful here as we don’t need to rely on the exchange being created beforehand (with the command line tool rabbitmqctl or via the web management interface,) but it similarly returns a Promise that will be kept with the exchange object. You probably also noticed that here the routing-key is not being passed to the publish method, this is because for a fanout exchange the routing key is ignored and the messages are delivered to all the consuming queues that are bound to the exchange.

The consumer code is likewise not dissimilar to our original consumer:

use Net::AMQP;

my $n = Net::AMQP.new;

my $connection = $n.connect.result;

react {
    whenever $n.open-channel(1) -> $channel {
        whenever $channel.declare-exchange('logs', 'fanout') -> $exchange {
            whenever $channel.declare-queue() -> $queue {
                whenever $queue.bind('logs') {
                    $queue.consume;
                    whenever $queue.message-supply.map( -> $v { $v.body.decode }) -> $message {
                        say $*PID ~ " : " ~ $message;
                    }
                }
                whenever signal(SIGINT) {
                    say $*PID ~ " exiting";
                    $n.close("", "");
                    done();
                }

            }
        }
    }
}

The exchange is declared in the same way that it was declared in the producer example, this is really a convenience so you don’t have to worry about which order to start the programs, the first one run will create the queue, however if you run the producer before the consumer is started the messages sent will be dropped as there is nowhere by default to route them. Here we are also declaring a queue without providing a name, this creates an “anonymous” queue (the name is made up by the broker,) because the name of the queue doesn’t play a part in the routing of the messages in this case.

You could provide a queue name but if there are duplicate names then the messages will be routed to the queues with the same names on a “first come, first served” basis, which is possibly not the expected behaviour (though it is possible and may have a use.)

Also in this case the queue has to be explictly bound to the exchange we have declared, in the first example the binding to the default exchange was performed by the broker automatically, but in most other cases you will have to use bind on the queue with the name of the exchange. bind, like many of the methods, returns a Promise that will be kept when the broker confirms that the operation has been completed (though in this case the value isn’t important.)

You should be able to start as many of the consumers as you want and they will all receive all the messages in the same order that they are sent. Of course in a real world application the consumers may be completely different prograns written in a variety of different languages.

Keeping on Topic

A common pattern is a set of consumers that are only interested in some of the messages published to a particular exchange, a classic example of this might be a logging system where there are consumers specialised to different log levels for instance. AMQP provides a topic exchange type that allows for the routing of the messages to a particular queue by pattern matching on the producer supplied routing key.

The simplest producer might be:

	use Net::AMQP;

	multi sub MAIN(Str $message = 'Hello, World', Str $level = 'application.info') {
		my $n = Net::AMQP.new;
		my $con =  await $n.connect;
		my $channel = $n.open-channel(1).result;
		my $exchange = $channel.declare-exchange('topic-logs', 'topic').result;
		$exchange.publish(routing-key => $level, body => $message.encode);
		await $n.close("", "");
	}

This should now be fairly clear from the previous examples, except in this case we declare the exchange as the topic type and also provide the routing key that will be used by the broker to match the consuming queues.

The consumer code itself is again fairly similar to the previous examples, except it will take a list of patterns on the command line that will be used to match the routing key sent to the exchange:

use Net::AMQP;

multi sub MAIN(*@topics ) {
    my $n = Net::AMQP.new(:debug);
    unless @topics.elems {
        say "will be displaying all the messages";
        @topics.push: '#';
    }
    my $connection = $n.connect.result;
    react {
        whenever $n.open-channel(1) -> $channel {
            whenever $channel.declare-exchange('topic-logs', 'topic') -> $exchange {
                whenever $channel.declare-queue() -> $queue {
                    for @topics -> $topic {
                        await $queue.bind('topic-logs', $topic);
                    }
                    $queue.consume;
                    my $body-supply = $queue.message-supply.map( -> $v { [ $v.routing-key, $v.body.decode ] }).share;
                    whenever $body-supply -> ( $topic , $message ) {
                            say $*PID ~ " : [$topic]  $message";
                    }
                }
            }
        }
    }
}

Here essentially the only difference from the previous consumer example is (aside from the type supplied to the exchange declaration,) that the topic is supplied to the bind method. The topic can be a simple pattern where a # will match any supplied routing key and the behaviour will be the same as a fanout exchange, otherwise a * can be used in any part of the binding topic as a wild card which will match any characters in the topic, so in this example application.* will match messages sent with the routing key application.info or application.debug for instance.

If there is more than one queue bound with the same pattern, they too will behave as if they were bound to a fanout exchange. If the bound pattern contains neither a hash nor an asterisk character then the queue will behave as if it was bound to a direct exchange as a queue with that name (that is to say it will have the messages delivered on a first come, first served basis.)

But there’s more to life than just AMQP

Of course. The beauty of the Perl 6 reactive model is that various sources feeding Supplies can be integrated into your producer code as touched on above and similarly a consumer can push a message to another transport mechanism.

I was delighted to discover while I was thinking about the examples for this that the following just works:

	use EventSource::Server;
	use Net::AMQP;
	use Cro::HTTP::Router;
	use Cro::HTTP::Server;

	my $supply = supply { 
		my $n = Net::AMQP.new;
		my $connection = $n.connect.result;
		whenever $n.open-channel(1) -> $channel {
			whenever $channel.declare-queue("hello") -> $queue {
				$queue.consume;
				whenever $queue.message-supply.map( -> $v { $v.body.decode }) -> $data {
					emit EventSource::Server::Event.new(type => 'hello', :$data);
				}
			}
		}
	};

	my $es = EventSource::Server.new(:$supply);

	my $application = route {
		get -> 'greet', $name {
			content 'text/event-stream; charset=utf-8', $es.out-supply;
		}
	}
	my Cro::Service $hello = Cro::HTTP::Server.new:
		:host, :port, :$application;
	$hello.start;

	react whenever signal(SIGINT) { $hello.stop; exit; }

This is a variation of the example in the EventSource::Server you could of course the alter it to use any of the exchange types as discussed above. It should work fine with the producer code from the first example. And (if you were so persuaded,) you could consume the events with a small piece of node.js code (or in some browser oriented javascript,):

	var EventSource = require('eventsource');

	var event = process.argv[2] || 'message';

	console.info(event);
	var v = new EventSource(' http://127.0.0.1:10000');

	v.addEventListener(event, function(e) {
		console.info(e);

	}, false);

Wrapping it up

I concluded after typing the first paragraph of this that I would never be able to do this subject justice in a short article, so I hope you consider this as an appetizer, I don’t think I’ll ever find the time to write the book that it probably deserves. But I do have all the examples based on the RabbitMQ tutorials so check that out and feel free to contribute.

 

Day 16 – 🎶 Deck The Halls With Perf Improvements 🎶

In the UK our lack of Thanksgiving leaves us with Christmas as a period of giving thanks and reflection up to the New Year. To that end I wanted to put together several bits and pieces I’ve been sat on for a while, around the state of Perl 6 performance, that highlight just how much effort is going into this. I’m not sure the wider programming community appreciates the pace and volume of effort that’s happening.

I’m not a core dev, but I have been a humble user of Perl 6 since just before the 2010 introduction of Rakudo*. Frequently the effort that’s already gone into Rakudo is overshadowed by the perceived effort yet to come. This is especially true of people taking a fresh look at Rakudo Perl 6, who might imagine a fly-by look is what next Christmas will be like. But Perl 6 has historically proven things always improve by next Christmas, for any Christmas you choose.

All the way back in Christmas 2014 I wrote an advent post about why I thought Perl 6 was great for doing Bioinformatics work. What was left out of that post, was why the implementation of Perl 6 on Rakudo was not at all ready for doing any serious Bioinformatics. The performance was really not there at all! My first attempts in Perl 6 (when the Parrot VM was in full force) left me with simple operations taking tens of minutes to execute, that I’d expect to be millisecond level perf. This is unfortunately anecdotal because I didn’t keep good track of timings then. But it was certainly not a great starting place.

However, fast forwarding to 2014 and MoarVM I felt comfortable writing the advent post, because I was cognisant of how much things had improved in my 4 years of being a user. But also that all development was on finishing the language definition and correct implementation back then. I am however a user that has been waiting for perf to get there. The time I think has mostly arrived. For this I have to give thanks to the tremendous daily hard work put in by all the core devs. It’s been incredible and motivating to watch it unfold. For me this Christmas is the goal Christmas, it’s arrived. 👏🏻🎊

I have been running and timing the tests for my BioInfo module that does some basic manipulations of biological sequence data for many years now. It does this in a really terrible way. Lots of mistakes in allocation and dropping of hashes in tight loops etc. But I’ve left this code alone -by now- for the better part of half a decade. Quietly benchmarking in private, and occasionally applauding efforts on the IRC channel when a quantum leap in perf was visible. Sub 10s was a big one! It happened suddenly from 30/40s. That jump came after I hinted on IRC a place my code was especially slow from profiling!

This is a bit of a long term view, if I zoom in on just this last year you can see that performance is still improving by integer factors if not large quantities of time.

Keep in mind that all of these profiles are not from released versions of the Rakudo compiler but HEAD from that day. So occasionally there is the odd performance regression as you can see above that usually isn’t left for a release.

So what’s going on? How are things getting better? There are several reasons. Many of the algorithmic choices and core built in functions in Perl 6 have been progressively and aggressively optimised at a source level (more later). But the MoarVM virtual machine backing Rakudo has also increased in its ability to optimise and JIT down to native code and inline specialised versions of code. This is in part thanks to the –profile option available with Rakudo Perl 6 since 2014 that provides all of this info.In the above plot of how MoarVM has treated the code frames of my compiled Perl 6 tests it should hopefully be clear that since this summer there are considerably more frames being JIT compiled, less interpreted, and almost all of the specialised frames (orange) end up as native JIT (green). If you want to know more about the recent work on the “spesh” MoarVM code specializer you can read about it on Jonathan Worthington’s 4-part posting on his blog. Baart Weigmans also has a blog outlining his work on the JIT compiler, and a nice talk presented recently about lots of new functionality that’s yet to land that should hopefully let many new developers pile on and help improve the JIT. So if that feels interesting as a challenge to you I recommend checking out the above links.

So that’s my benchmark and my goals, most of which revolve around data structure creation and parsing. However, what about other stuff like numeric work? Has that kept up too? Without anyone pushing, like I pushed my view of where things could be improved. The answer is yes!

Once upon a time, back in 2013 a gentleman by the name of Tim King took an interest in finding prime numbers in Perl 6. Tim was fairly upset with the performance he discovered. Rightly so. He started out with the following pretty code:

Find any prime via a junction on the definition of a prime, really a nice elegant solution! But Tim was aghast that Junctions were slow, with the above code taking him 11s to see the first 1000 primes. Today that super high level code takes 0.96s.

Being unhappy with how slow the junction based code was Tim went on to do the more standard iterative approaches. Tim vanished from online shortly after these posts. But he left a legacy that I continued. His code for the prime benchmarks and my adaptation with results through time can be found in this gist. Below is the punchline with another graph showing the average time taken for finding the first 1000 primes over 100 trials each. Vertical lines in 2015 is a higher standard deviation.

Again with a zoomed in view of more recently (with the latest data point worrying me a little that I screwed up somehow…)

The above convergence to a point, is the overhead of starting and stopping the Rakudo runtime and MoarVM. Finding primes isn’t the effort it once was, with it being marginally slower than just Rakudo starting. At least an order of magnitude faster regardless of how high level and elegant the code solution you choose.

Ok so we’ve seen MoarVM got some shiny new moving parts. But huge effort has been put in by developers like Liz, jnthn, Zoffix and more recently in the world of strings Samcv to improve what MoarVM and Rakudo are actually doing under the hood algorithmically.

Sidenote: I’m sure I am not doing most other devs justice at all, especially by ignoring JVM efforts in this post. I would recommend everyone goes and checks out the commit log to see just how many people are now involved in making Rakudo faster, better, stronger. Im sure they would like to see your thanks at the bottom of this article too!

So saving you a job of checking out the commit log I’ve done some mining of my own looking at commits since last Christmas related to perf gains. Things with N% or Nx faster. Like the following:

3c6277c77 Have .codes use nqp::codes op. 350% faster for short strings

ee4593601 Make Baggy (^) Baggy about 150x faster

Those two commits on their own would be an impressive boost to a programming project in the timescale of a years core development. But they are just two of hundreds of commits this year alone.

Below are some histograms of the numbers of commits and the % and x multiplier increase in performance they mentioned. You can grep the logs yourself with the code above. There are some even more exciting gains during 2016 worth checking out.

These really are the perf improvement commits for 2017 alone, with more landing almost daily. This doesn’t even include many of the I/O perf gains from Zoffix’ grant, as they were not always benchmarked before/after. 2016 is equally dense with some crazy >1000x improvements. Around ten commits alone this year with 40x improvement! This is really impressive to see. At least to me. I think its also not obvious to many on the project how much they’re accomplishing. Remember these are singular commits. Some are even compounded improvement over the year!

I will leave it here. But really thank you core devs, all of you. It’s been a great experience watching and waiting. But now it’s time for me to get on with some Perl 6 code in 2018! It’s finally Christmas.

Day 15 – A Simple Web Spider With Promises

Promises, Promises

Last summer, I applied for a programming job and the interviewer asked me to write a program that would crawl a given domain, only following links in that domain, and find all pages that it referenced. I was allowed to write the program in any language, but I chose to perform the task in the Go language because that is the primary language that this company uses. This is an ideal task for concurrent programming, and Go has very good modern, if somewhat low-level concurrency support. The main work in a web spider, which will be performed as many times as there are unique anchor links discovered in the domain, is to do an HTTP GET on each page and parse the page text for new links. This task may be safely done in parallel because there is no likelihood (unless you do it very badly) that any invocation of crawling code will interfere with any other invocation of it.

The creators of Go and Perl 6 were inspired by Sir Anthony Hoare’s seminal 1978 work “Communicating Sequential Processes”, though it is notable that Perl 6 code tends to be more concise and therefore easier to tuck into a blog post. Indeed, the Go designers invariably refer to their constructs as “concurrency primitives”. The concurrent spider code in Go that I wrote for my job application came in at about 200 lines, vs rather less than half that size in Perl 6.

So let’s look at how a simple web crawler may be implemented in Perl 6. The built-in Promise class allows you to start, schedule and examine the results from asynchronous computations. All you need to do is give a code reference to the Promise.start method, then call the await method, which blocks until the promise has finished executing. You may then test the result method to find out if the promise has been Kept or Broken.

You can run the code in this posting by saving it to a local file, e.g. web-spider.p6. Use zef to install HTML::Parser::XML and HTTP::UserAgent as well as IO::Socket::SSL if you wish to crawl https sites. I will warn you that SSL support seems a little ropey at present so it is best to stick to http sites. The MAIN sub in a Perl 6 program, when present, indicates a stand-alone program and this is where execution will start. The arguments to MAIN represent command line parameters. I wrote this program so that it will spider the Perlmonks site by default, but you can override that as follows:

$ perl6 web-spider.p6 [–domain=http://example.com]

Simple Perl 6 Domain Spider

use HTML::Parser::XML;
use XML::Document;
use HTTP::UserAgent;

sub MAIN(:$domain="http://www.perlmonks.org") {

    my $ua =  HTTP::UserAgent.new;
    my %url_seen;
    my @urls=($domain);

    loop {
        my @promises;
        while ( @urls ) {
            my $url = @urls.shift;
            my $p = Promise.start({crawl($ua, $domain, $url)});
            @promises.push($p);
        }
        await Promise.allof(@promises);
        for @promises.kv -> $index, $p {
            if $p.status ~~ Kept {
                my @results =  $p.result;
                for @results {
                    unless %url_seen{$_} {
                        @urls.push($_);
                        %url_seen{$_}++;
                    }
                }
            }
        }
        # Terminate if no more URLs to crawl
        if @urls.elems == 0 {
            last;
        }
    }
    say %url_seen.keys;
}

# Get page and identify urls linked to in it. Return urls.
sub crawl($ua, $domain, $url) {
    my $page = $ua.get($url);
    my $p = HTML::Parser::XML.new;
    my XML::Document $doc = $p.parse($page.content);
    # URLs to crawl
    my %todo;
    my @anchors = $doc.elements(:TAG<a>, :RECURSE);
    for @anchors -> $anchor {
        next unless $anchor.defined;
        my $href =  $anchor.attribs<href>;

        # Convert relative to absolute urls
        if $href.starts-with('/') or $href.starts-with('?') {
            $href = $domain ~ $href;
        }

        # Get unique urls from page
        if $href.starts-with($domain) {
              %todo{$href}++;
        }
    }
    my @urls = %todo.keys;

    return @urls;
}

In Conclusion

Concurrent programming will always have many pitfalls, from race conditions to resource starvation and deadlocks, but I think it’s clear that Perl 6 has gone quite some way towards making this form of programming much more accessible to everyone.

Day 14 – The Little Match Girl: Building and Testing Big Grammars in Perl 6

Perl 6 Grammars are great, but what is it like working with them in a project? Here is a bittersweet story of my experience before Christmas, and after Christmas. You can find the repository here. I do not come from a computer science background, so perhaps it will seem humble, but here are my pitfalls and triumphs as I learned Perl 6 Grammars.

The First Match

Like the Little Match Girl, our story takes place before Christmas. The Little Match Girl was tasked with selling a bundle of match sticks on Christmas Eve (New Years, actually. I did go back and read the story. Christmas just fits better with Perl 6), while I was tasked with extracting annotations from Modelica models to render as vector graphics. Now, Modelica is a wonderful object oriented modeling language, and I am going to completely gloss over it, except to mention that it has a very nice specification document (pdf) that contained a Concrete Syntax section in the Appendix. Perusing this section, I realized that the “syntactic meta symbols” and “lexical units” looked suspiciously like the Perl 6 Grammars that I had recently read a blog post about, and had been anxious to try out.

Example from Modelica Concrete Syntax:

class-definition :
[ encapsulated ] class-prefixes
class-specifier

Example of Perl 6 rule:

rule class_definition {
  [<|w>'encapsulated'<|w>]? 
  <class_prefixes>
  <class_specifier>
}

It was like the Little Match Girl striking the first match, and seeing for the first time a wonderful world beyond her stark reality. A warm little stove. And then it went out.

It was so close that I plopped it into a text editor, and replaced the not Perl 6 bits with some Perl 6 bits to see if it would run. It didn’t run. I hacked away at it, I pointed TOP at different bits to tackle smaller chunks. There were whitespace symbols everywhere, regexes, tokens, rules. I was able to parse some parts, others mysteriously didn’t work. Looking back, it must have been awful. In the meantime, we hacked together a traditional regular expression to extract the annotations, and I placed my Grammar on the shelf.

The Second Match

Not long after, the Grammar::Profiler and Grammar::Debugger were published, and I was inspired to give it another go. I was granted great insights into where my rules were behaving unexpectedly. I was able to drill down through the grammar deeper than before. The second match had been lit, and I was presented with a feast. And then it went out.

In the debugger, I dove into an abyss of backtracking. The profiler ran forever as it dove down into the morass, again and again. I was able to get much farther, but eventually ran into a wall. Success seemed so close, but there were too many missing pieces in my own experience, and documentation for me to get past the wall.

The Third Match

Time passed, and Christmas came. I had a new position, with time for personal projects. I had the ever improving Grammar documentation to guide me. I had read the book Working Effectively With Legacy Code. It was enough to warrant charging the hill once more.

Object orientation

This was the biggest breakthrough for me. When I understood from the documentation that Tokens, rules and regex were funny looking methods, I suddenly had all of the pieces. When I got home, I immediately checked if I could override TOP, and I checked if I could put the Grammar methods into a role. Both worked delightfully, and I was in business. Rather than having one monolithic, all-or-nothing Grammar, I could break it up into chunks. This greatly improved organization and testability of the code.

One particularly bodacious thing was that I was able to neatly split the grammar up into roles corresponding to those found in the Modelica specification.

lib
----Grammar
--------Modelica
------------LexicalConventions.pm6
------------ClassDefinition.pm6
------------Extends.pm6
------------ComponentClause.pm6
------------Modification.pm6
------------Equations.pm6
------------Expressions.pm6
--------Modelica.pm6

Unit testing: one layer at a time

Object orientation opened up a sensible scheme of unit testing, and saved me from the nonsense of ad hoc testing by passing bits of Modelica into the Grammar. You can inherit and override grammars as you would any other class. This allows you to test each rule or token separately, splitting your grammar up into bite-sized layers. You just override TOP with the rule or token to test, and override any dependencies with placeholder methods.

Definition of expression from Expressions.pm6:

rule expression {
  [
  <|w>'if'<|w> <expression> <|w>'then'<|w> <expression> [
  <|w>'elseif'<|w> <expression> <|w>'then'<|w> <expression>
  ]*
  <|w>'else'<|w> <expression>
  ]
  ||
  <simple_expression>
}

Here we see that expression depends on itself and simple_expression. In order to test, we replace the usual simple_expression rule with a placeholder. In this case it just matches the string 'simple_expression'.

Overridden test Grammar from Expressions.t:

grammar TestExpression is Grammar::Modelica {
rule TOP {^ <expression> $}
rule simple_expression { 'simple_expression' }
}
ok TestExpression.parse('simple_expression');
...

Regression testing is also much more pleasant when you can isolate the problematic portion of code, and create an overridden Grammar that targets it specifically.

<|w> is your friend

In my first efforts, trying to get things like Modelica reserved words working properly was one of the “banes of my existence”. That changed after I found the word boundary matching token <|w>. When I slap one on each side, it works, whether next to white space or a punctuation character.

From ComponentClause.pm6:

rule type_prefix {
  [<|w>[ 'flow' || 'stream' ]<|w>]?
  [<|w>[ 'discrete' || 'parameter' || 'constant' ]<|w>]?
  [<|w>[ 'input' || 'output' ]<|w>]?
}

Token, rule and regex

There is good documentation for these now, but I, also, will briefly contribute a description of my experience. I found that rule and its :sigspace magic was the best choice most of the time. token was helpful where tight control of format was needed.

regex is for backtracking. For Modelica, I have found it to be unhelpful, likely because it was designed to be a single pass language. token and rule work in the places I thought I needed it. All of my unit tests passed after I removed them, and the Grammar succeeded on four more Modelica Standard Library files. Only use this when you need it.

End With the Beginning

Another bit that was frustrating to me was class definition syntax. Modelica uses the form some_identifier ... end some_identifier for its classes. How to ensure that the same identifier was used at the beginning and end was troublesome for me. Fortunately, Perl 6 allows you to use a capture inside the Grammar methods. The (<IDENT>) capture below populates $0, which can then be used to ensure that our long_class_specifier ends with the proper identifier.

rule long_class_specifier {
  [(<IDENT>) <string_comment> <composition> <|w>'end'<|w> $0 ]
  ||
  [<|w>'extends'<|w> (<IDENT>) <class_modification>? <string_comment> <composition> <|w>'end'<|w> $0 ]
}

Integration Testing: lighting all the matches at once

After my unit tests were all passing, I felt a little trepidation. Sure it can parse my contrived test cases, but how will it do with real Modelica? With trembling hand, I fed it some of Michael Tiller’s example code from his Modelica e-book. It worked! No fiddling around with subtle things that I overlooked, no funny parsing bugs or eternal backtracking. Just success.

Now, stars do occasionally align. Miracles do happen. Sufficiently clever unit tests can be remarkably good at preventing bugs. I have been around the block enough times to verify. Recalling a presentation by Damian Conway, I decided to run it against the entire Modelica Standard Library. Not exactly all of CPAN, but 305 files is better than the mere two example models I had tried so far.

I wrote the script, pointed it at the Modelica directory, and fired it up. It churned through the library and wheezed to a stop. 150 failures. Now that is familiar territory. After several iterations, I am down to 66 failures when I run it on my parse_modelica_library branch. I just go through a file that is failing, isolate the code that is having issues, and write a regression test for it.

So, in the end the Little Match Girl lit all the rest of her bundle. Then, she died. Don’t die, but you can light all 305 matches with me, in parallel, with examples/parseThemAll.p6:

#!perl6

use v6;
use Test;
use lib '../lib';
use Grammar::Modelica;


plan 305;

sub light($file) {
  my $fh = open $file, :r;
  my $contents = $fh.slurp-rest;
  $fh.close;

  my $match = Grammar::Modelica.parse($contents);
  say $file;
  ok $match;
}

sub MAIN($modelica-dir) {
    say "directory: $modelica-dir";
    die "Can't find directory" if ! $modelica-dir.IO.d;

    # modified from the lovely docs at
    # https://docs.perl6.org/routine/dir
    my @stack = $modelica-dir.IO;
    my @files;
    while @stack {
      for @stack.pop.dir -> $path {
        light($path) if $path.f && $path.extension.lc eq 'mo';
        @stack.push: $path if $path.d;
      }
    }
    # faster to do in parallel
    @files.race.map({light($_)});
}

I will see how many more I can persuade to pass before Christmas. Then perhaps I will figure out how to write some rules to build a QAST.

Merry Christmas!

Day 13 – Mining Wikipedia with Perl 6

Introduction

Hello, everyone!

Today, let me introduce how to mine Wikipedia Infobox with Perl 6.

Wikipedia Infobox plays a very important role in Natural Language Processing, and there are many applications that leverage Wikipedia Infobox:

  • Building a Knowlege Base (e.g. DBpedia [0])
  • Ranking the importance of attributes [1]
  • Question Answering [2]

Among them, I’ll focus on the infobox extraction issues and demonstrate how to parse the sophisticated structures of the infoboxes with Grammar and Actions.

Are Grammar and Actions difficult to learn?

No, they aren’t!

You only need to know just five things:

  • Grammar
    • token is the most basic one. You may normally use it.
    • rule makes whitespace significant.
    • regex makes match engine backtrackable.
  • Actions
    • make prepares an object to return when made calls on it.
    • made calls on its invocant and returns the prepared object.

For more info, see: https://docs.perl6.org/language/grammars

What is Infobox?

Have you ever heard the word “Infobox”?

For those who haven’t heard it, I’ll explain it briefly.

An easy way to understand Infobox is by using a real example:

perl6infobox

As you can see, the infobox displays the attribute-value pairs of the page’s subject at the top-right side of the page. For example, in this one, it says the designer (ja: 設計者) of Perl 6 is Larry Wall (ja: ラリー・ウォール).

For more info, see: https://en.wikipedia.org/wiki/Help:Infobox

First Example: Perl 6

Firstly to say, I’ll demonstrate the parsing techniques using Japanese Wikipedia not with English Wikipedia.

The main reason is that parsing Japanese Wikipedia is my $dayjob :)

The second reason is that I want to show how easily Perl 6 can handle Unicode strings.

Then, let’s start parsing the infobox in the Perl 6 article!

The code of the article written in wiki markup is:


There are three problematic portions of the code:

  1. There are superfluous elements after the infobox block, such as the template {{プログラミング言語}} and the lead sentence starting with '''Perl 6'''.
  2. We have to discriminate three types of tokens: anchor text (e.g. [[Rakudo]]), raw text (e.g. Rakudo Star 2016.04), weblink (e.g. [https://perl6.org/ Perl6.org]).
  3. The infobox doesn’t start at the top position of the article. In this example, {{Comb-stub}} is at the top of the article.

OK, then I’ll show how to solve the above problems in the order of Grammar, Actions, Caller (i.e. The portions of the code that calls Grammar and Actions).

Grammar

The code for Grammar is:

  • Solutions to the problem 1:
    • Use .+ to match superfluous portions. (#1)
  • Solutions to the problem 2:
    • Prepare three types of tokens: anchortext (#2), weblink (#3), and rawtext (#4).
      • The tokens may be separated by delimiter (e.g. ,), so prepare the token delimiter. (#5)
    • Represent the token value-content as an arbitrary length sequence of the four tokens (i.e. anchortext, weblink, rawtext, delimiter). (#6)
  • Solutions to the problem 3:
    • There are no particular things to mention.

Actions

The code for Actions is:

  • Solutions to the problem 2:
    • Make the token value-content consist of the three keys: anchortext, weblink,  and rawtext.
  • Solutions to the problem 1 and 3:
    • There are no particular things to mention.

Caller

The code for Caller is:

  • Solutions to the problem 3:
    • Read the article line-by-line and make a chunk which contains the lines between the current line and the last line.  (#1)
    • If the parser determines that:
      • The chunk doesn’t contain the infobox, it returns an undefined value. One of the good ways to receive an undefined value is to use $ sigil. (#2)
      • The chunk contains the infobox, it returns a defined value. Use @()contextualizer and iterate the result. (#3)
  • Solutions to the problem 1 and 2:
    • There are no particular things to mention.

Running the Parser

Are you ready?
It’s time to run the 1st example!

The example we have seen may be too easy for you. Let’s challenge more harder one!

Second Example: Albert Einstein

As the second example, let’s parse the infobox of Albert Einstein.

The code of the article written in wiki markup is:

As you can see, there are five new problems here:

  1. Some of the templates
    1. contain newlines; and
    2. are nesting (e.g. {{nowrap|{{仮リンク|...}}...}})
  2. Some of the attribute-value pairs are empty.
  3. Some of the value-sides of the attribute-value pairs
    1. contain break tag; and
    2. consist of different types of the tokens (e.g. anchortext and rawtext).
      So you need to add positional information to represent the dependency between tokens.

I’ll show how to solve the above problems in the order of Grammar, Actions.

The code of the Caller is the same as the previous one.

Grammar

The code for Grammar is:

  • Solutions to the problem 1.1:
    • Create the token value-content-list-nl which is the newline separated version of the token value-content-list. It is useful to use modified quantifier % to represent this kind of sequence. (#1)
    • Create the token template. In this one, define a sequence that represents Plainlist template. (#2)
  • Solutions to the problem 1.2:
    • Make the token template enable to call the token value-content-list. This modification triggers recursive call and captures nesting structure, because  the token value-content-list contains the token template. (#3)
  • Solutions to the problem 2:
    • In the token property, define a sequence that value-side is empty (i.e. a sequence that ends with ‘=’). (#4)
  • Solutions to the problem 3.1:
    • Create the token br (#5)
    •  Let the token br follow the token value-content in the two tokens:
      • The token value-content-list (#6)
      • The token value-content-list-nl (#7)

Actions

The code for Actions is:

  • Solutions to the problem 3.2:
    • Use Match.from and Match.to to get the match starting position and the match ending position respectively when calling make. (#1 ~ #4)

Running the Parser

It’s time to run!

Conclusion

I demonstrated the parsing techniques of the infoboxes. I highly recommend you to create your own parser if you have a chance to use Wikipedia as a resource for NLP. It will deepen your knowledge about not only Perl 6 but also Wikipedia.

See you again!

Citations

[0] Lehmann, Jens, et al. “DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia.” Semantic Web 6.2 (2015): 167-195.

[1] Ali, Esraa, Annalina Caputo, and Séamus Lawless. “Entity Attribute Ranking Using Learning to Rank.”

[2] Morales, Alvaro, et al. “Learning to answer questions from wikipedia infoboxes.” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016.

License

All of the materials from Wikipedia are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.


Itsuki Toyota
A web developer in Japan.

Day 12 – The Year of Perl 6 Books

We can quibble all day if 2017 was the year of the Linux desktop, but there can be little doubt that it was the year of the Perl 6 book.

Perl 6 at a Glance

December 2016 brought us the ebook launch of Perl 6 at a Glance by Andrew Shitov, and then in 2017 the print version came it. It is an introduction to Perl 6 that targets programmers already familiar with another language.

It is the first of a generation of “modern” Perl 6 books. There weren’t many Perl 6 books before, most notably “Perl 6 and Parrot Essentials”, which was written when Perl 6 was very much a language in flux. The December 2015 release of the Perl 6 language version v6.c (and the accompanying Rakudo Perl 6 compiler) finally offers enough stability to make Perl 6 books work.

Think Perl 6

The next book released in 2017 was “Think Perl 6: How to Think Like a Computer Scientist”. It is a Perl 6 adaptation of Allen Downey’s great book Think Python: How to Think Like a Computer Scientist, lovingly ported to Perl 6 by Laurent Rosenfeld. It is available in print from O’Reilly, and freely available as an ebook from Green Tea Press. It is also available under an Open Source license in its source form (LaTeX) on GitHub.

“Think Perl 6” is an introduction to programming and computer science that happens to use Perl 6 as its primary tool. It targets the absolute beginner, and goes into a lot of detail on basic concepts such as branches, loops, variables, expressions, functions, recursion and so on.

Learning to Program with Perl 6

A book I haven’t had on my radar until it was available for purchase on Amazon was Learning to program with Perl 6: First Steps: Getting into programming without leaving the command line by JJ Merelo. You can buy it on Amazon pretty cheaply, or check it out on GitHub, where you can find a musical as bonus material.

It mostly targets beginners, and also discusses some things related to programming, like the use of GitHub, some shell features, and SSH. It is a light-hearted introduction into computing and Perl 6.

Perl 6 Fundamentals

Perl 6 Fundamentals started its life as “Perl 6 by Example”, written by Moritz Lenz, aka yours truly. (Yes, authors write about themselves in the third person. That “About the Author” section in each book? Written by the author. In third person. Weird). When Apress acquired the book, it was renamed to Perl 6 Fundamentals: A Primer with Examples, Projects, and Case Studies. It is available from everywhere that you can buy books. At least I hope so :-)

Each chapter focuses on one (at least somewhat) practical example, and uses that as an excuse to talk about various Perl 6 features, including concurrency, functional programming, grammars, and calling Python libraries through Inline::Python. (You can read the chapter about Inline::Python over at perltricks.com.) It targets programmers with previous experience, though not necessarily Perl 6 (or Perl 5) experience.

Larry Wall has kindly written a foreword for the book.

Perl 6 Deep Dive

Andrew Shitov’s second Perl 6 book, Perl 6 Deep Dive is, as the name suggests, more comprehensive and a, well, deeper dive, than “Perl 6 at Glance”, though somewhat similar in style. With more than 350 pages, it seems to have the largest coverage of Perl 6 features of any book so far.

Using Perl 6

Guess who’s released a third Perl 6 book within one year? That’s right, Andrew Shitov again. Somebody give that man a medal! Using Perl 6 is a collection of 100 programming challenges/problems and their solution in Perl 6. This is what Andrew wrote about it:

About a year ago, I decided to write a book about using Perl 6. Later, the plans changed and I published “Perl 6 at a Glance”, after which I wanted to write “Migrating to Perl 6” but instead wrote “Perl 6 Deep Dive” for Packt Publishing. Here and there, I was giving trainings on Perl 5, Python, and JavaScript, and was always suffering from finding a good list of easy tasks that a newcomer can use to train their skills in the language. Finally, I made up the list and solved it in Perl 6. This is the content of “Using Perl 6” — a book with solutions to 100 programming challenges, from simple to intermediate, with explanations of the Perl 6 constructions used.

Since his fourth book, *Migrating to Perl 6 will be released in 2018, it doesn’t get its own section. Take that, Andy! :-) This is the “Perl 6 for Perl 5 programmers” book that people (Perl 5 people, mostly) have been asking for on IRC and some other media.

And of course I won’t mention Andrew’s kickstarter for a cookbook-style project, because that will further skew the stats. Ooops, I just did. Hmm. Well, go support the man!

Parsing with Perl 6 Regexes and Grammars

After writing a general Perl 6 book, I wanted to focus on a narrower topic. A non-representative poll on twitter confirmed my suspicion that regexes and grammars would be the best niche, and so Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing was born.

It requires basic programming knowledge to read, but no prior exposure to regexes to Perl. It goes from the building blocks of regexes to fully-featured parsers, including Abstract Syntax Tree generation and error reporting. A discussion of three very different example parsers concludes the nearly 200 pages, which could have also been titled far more than you ever wanted to know about parsing with Perl 6.

Right now, the ebook version is available for purchase, and I hope that the print version will be ready by Christmas. (And I’m talking about Christmas 2017, to be sure :)

Books in the Pipeline

I’d be remiss if I didn’t point out two more books that aren’t available yet, but might be in the next months or years.

brian d foy works on Learning Perl 6. In his last update, he shares that the first draft of the book is written, with a plan for things that need rewriting.

Gabor Szabo crowd-funded a book on Web Application Development in Perl 6 using the Bailador framework. The earlier chapters are mostly fleshed out, and the later chapters mostly exist as skeletons. Gabor expects it to be finished in 2018.

Keeping Track

The flood of Perl 6 books has made it hard for newcomers to decide which book to read, so I created the https://perl6book.com/ website that has one-line summaries, and a flow-chart for deciding which book to buy.

Even though I had input from other Perl 6 authors, it certainly reflects my biases. But, you can help to improve it!

Summary

With 7 Perl 6 books published by three major publishers in 2017, it’s been a fantastic year. I am also very happy with the diversity of the books, their target audience and styles. I hope you are too!

A final plea: If you have read any of these books, please give the author some feedback. They put incredible amounts of work into those, and feedback helps the author’s learning process and motivation. And if you liked a book, maybe even give it 5 stars on Amazon and write a line or two about why you liked it.