Day 15 – A Simple Web Spider With Promises

Promises, Promises

Last summer, I applied for a programming job and the interviewer asked me to write a program that would crawl a given domain, only following links in that domain, and find all pages that it referenced. I was allowed to write the program in any language, but I chose to perform the task in the Go language because that is the primary language that this company uses. This is an ideal task for concurrent programming, and Go has very good modern, if somewhat low-level concurrency support. The main work in a web spider, which will be performed as many times as there are unique anchor links discovered in the domain, is to do an HTTP GET on each page and parse the page text for new links. This task may be safely done in parallel because there is no likelihood (unless you do it very badly) that any invocation of crawling code will interfere with any other invocation of it.

The creators of Go and Perl 6 were inspired by Sir Anthony Hoare’s seminal 1978 work “Communicating Sequential Processes”, though it is notable that Perl 6 code tends to be more concise and therefore easier to tuck into a blog post. Indeed, the Go designers invariably refer to their constructs as “concurrency primitives”. The concurrent spider code in Go that I wrote for my job application came in at about 200 lines, vs rather less than half that size in Perl 6.

So let’s look at how a simple web crawler may be implemented in Perl 6. The built-in Promise class allows you to start, schedule and examine the results from asynchronous computations. All you need to do is give a code reference to the Promise.start method, then call the await method, which blocks until the promise has finished executing. You may then test the result method to find out if the promise has been Kept or Broken.

You can run the code in this posting by saving it to a local file, e.g. web-spider.p6. Use zef to install HTML::Parser::XML and HTTP::UserAgent as well as IO::Socket::SSL if you wish to crawl https sites. I will warn you that SSL support seems a little ropey at present so it is best to stick to http sites. The MAIN sub in a Perl 6 program, when present, indicates a stand-alone program and this is where execution will start. The arguments to MAIN represent command line parameters. I wrote this program so that it will spider the Perlmonks site by default, but you can override that as follows:

$ perl6 web-spider.p6 [–domain=http://example.com]

Simple Perl 6 Domain Spider

use HTML::Parser::XML;
use XML::Document;
use HTTP::UserAgent;

sub MAIN(:$domain="http://www.perlmonks.org") {

    my $ua =  HTTP::UserAgent.new;
    my %url_seen;
    my @urls=($domain);

    loop {
        my @promises;
        while ( @urls ) {
            my $url = @urls.shift;
            my $p = Promise.start({crawl($ua, $domain, $url)});
            @promises.push($p);
        }
        await Promise.allof(@promises);
        for @promises.kv -> $index, $p {
            if $p.status ~~ Kept {
                my @results =  $p.result;
                for @results {
                    unless %url_seen{$_} {
                        @urls.push($_);
                        %url_seen{$_}++;
                    }
                }
            }
        }
        # Terminate if no more URLs to crawl
        if @urls.elems == 0 {
            last;
        }
    }
    say %url_seen.keys;
}

# Get page and identify urls linked to in it. Return urls.
sub crawl($ua, $domain, $url) {
    my $page = $ua.get($url);
    my $p = HTML::Parser::XML.new;
    my XML::Document $doc = $p.parse($page.content);
    # URLs to crawl
    my %todo;
    my @anchors = $doc.elements(:TAG<a>, :RECURSE);
    for @anchors -> $anchor {
        next unless $anchor.defined;
        my $href =  $anchor.attribs<href>;

        # Convert relative to absolute urls
        if $href.starts-with('/') or $href.starts-with('?') {
            $href = $domain ~ $href;
        }

        # Get unique urls from page
        if $href.starts-with($domain) {
              %todo{$href}++;
        }
    }
    my @urls = %todo.keys;

    return @urls;
}

In Conclusion

Concurrent programming will always have many pitfalls, from race conditions to resource starvation and deadlocks, but I think it’s clear that Perl 6 has gone quite some way towards making this form of programming much more accessible to everyone.

5 thoughts on “Day 15 – A Simple Web Spider With Promises

  1. There are two bugglets in the code:
    1) `%urls_seen` needs to be `%url_seen`
    2) `my $href = $anchor.Str;` needs to be `my $href = $anchor.attribs<href>;`

  2. I see few issues in this code:
    1. Brute force memory usage model. URLs to be processed are converted into promises way too early, even if there is no chance they will get threadpool access soon.
    2. Not utilizing full potential of the machine because whole batch of URLs must be completed before next one is started. One slow site remaining in batch will cause unused CPU cores.
    3. Lack of control over parallelization level.

    IMO more efficient pattern would be:

    
    my @tasks = 'http://main-site.com';
    my @workers;
    
    do {
    
        if @tasks {
    
            # start promise ALAP (solves issue #1)
            push @workers, start { fetch_page( @tasks.shift )  };
    
            # there are still slots on threadpool
            # (also allows custom parallelization level if needed)
            next if +@workers  $worker {
    
            # make new tasks reported from finished threads
            # available for processing immediately
            # (uniqueness has to be added too here)        
            if $worker.status ~~ Kept {
                push @tasks, |$worker.result;
            }
            # worker still running
            else {
                push @remaining_workers, $worker;
           }
        }
       
        @workers = @remaining_workers;
    
    } while @workers or @tasks;
    
  3. (duplicated post, WordPress trashed HTML chars in previous example)

    I see few issues in this code:
    1. Brute force memory usage model. URLs to be processed are converted into promises way too early, even if there is no chance they will get threadpool access soon.
    2. Not utilizing full potential of the machine because whole batch of URLs must be completed before next one is started. One slow site remaining in batch will cause unused CPU cores.
    3. Lack of control over parallelization level.

    IMO more efficient approach would be:

    
    my @tasks = 'http://main-site.com';
    my @workers;
    
    do {
    
        if @tasks {
    
            # start promise ALAP (solves issue #1)
            push @workers, start { fetch_page( @tasks.shift )  };
    
            # there are still slots on threadpool
            # (also allows custom parallelization level if needed)
            next if +@workers < $*SCHEDULER.max_threads -1;
        }
    
        # analyze any finished thread result immediately (solves issue #2)
        await Promise.anyof( @workers );
    
        my @remaining_workers;
        for @workers -> $worker {
    
            # make new tasks reported from finished threads
            # available for processing immediately
            # (uniqueness has to be added too here)        
            if $worker.status ~~ Kept {
                push @tasks, |$worker.result;
            }
            # worker still running
            else {
                push @remaining_workers, $worker;
           }
        }
       
        @workers = @remaining_workers;
    
    } while @workers or @tasks;
    

    TIMTOWTDI, so if your crawler was supposed to “just work”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s