Promises, Promises
Last summer, I applied for a programming job and the interviewer asked me to write a program that would crawl a given domain, only following links in that domain, and find all pages that it referenced. I was allowed to write the program in any language, but I chose to perform the task in the Go language because that is the primary language that this company uses. This is an ideal task for concurrent programming, and Go has very good modern, if somewhat low-level concurrency support. The main work in a web spider, which will be performed as many times as there are unique anchor links discovered in the domain, is to do an HTTP GET on each page and parse the page text for new links. This task may be safely done in parallel because there is no likelihood (unless you do it very badly) that any invocation of crawling code will interfere with any other invocation of it.
The creators of Go and Perl 6 were inspired by Sir Anthony Hoare’s seminal 1978 work “Communicating Sequential Processes”, though it is notable that Perl 6 code tends to be more concise and therefore easier to tuck into a blog post. Indeed, the Go designers invariably refer to their constructs as “concurrency primitives”. The concurrent spider code in Go that I wrote for my job application came in at about 200 lines, vs rather less than half that size in Perl 6.
So let’s look at how a simple web crawler may be implemented in Perl 6. The built-in Promise class allows you to start, schedule and examine the results from asynchronous computations. All you need to do is give a code reference to the Promise.start method, then call the await method, which blocks until the promise has finished executing. You may then test the result method to find out if the promise has been Kept or Broken.
You can run the code in this posting by saving it to a local file, e.g. web-spider.p6. Use zef to install HTML::Parser::XML and HTTP::UserAgent as well as IO::Socket::SSL if you wish to crawl https sites. I will warn you that SSL support seems a little ropey at present so it is best to stick to http sites. The MAIN sub in a Perl 6 program, when present, indicates a stand-alone program and this is where execution will start. The arguments to MAIN represent command line parameters. I wrote this program so that it will spider the Perlmonks site by default, but you can override that as follows:
$ perl6 web-spider.p6 [–domain=http://example.com]
Simple Perl 6 Domain Spider
use HTML::Parser::XML; use XML::Document; use HTTP::UserAgent; sub MAIN(:$domain="http://www.perlmonks.org") { my $ua = HTTP::UserAgent.new; my %url_seen; my @urls=($domain); loop { my @promises; while ( @urls ) { my $url = @urls.shift; my $p = Promise.start({crawl($ua, $domain, $url)}); @promises.push($p); } await Promise.allof(@promises); for @promises.kv -> $index, $p { if $p.status ~~ Kept { my @results = $p.result; for @results { unless %url_seen{$_} { @urls.push($_); %url_seen{$_}++; } } } } # Terminate if no more URLs to crawl if @urls.elems == 0 { last; } } say %url_seen.keys; } # Get page and identify urls linked to in it. Return urls. sub crawl($ua, $domain, $url) { my $page = $ua.get($url); my $p = HTML::Parser::XML.new; my XML::Document $doc = $p.parse($page.content); # URLs to crawl my %todo; my @anchors = $doc.elements(:TAG<a>, :RECURSE); for @anchors -> $anchor { next unless $anchor.defined; my $href = $anchor.attribs<href>; # Convert relative to absolute urls if $href.starts-with('/') or $href.starts-with('?') { $href = $domain ~ $href; } # Get unique urls from page if $href.starts-with($domain) { %todo{$href}++; } } my @urls = %todo.keys; return @urls; }
In Conclusion
Concurrent programming will always have many pitfalls, from race conditions to resource starvation and deadlocks, but I think it’s clear that Perl 6 has gone quite some way towards making this form of programming much more accessible to everyone.