Day 15 – Bioinformatics and the joy of Perl 6

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tale of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot of sed+awk use cases for one liners is also a welcome feature for a busy bioinformatician. Perl persists today as a de facto scripting language due to the large community effort known as BioPerl. For those not in the know the API of BioPerl inspired a whole host of work in most other languages such as Biopython, BioJava, BioRuby etc. To date none are quite as feature complete as BioPerl though each has its unique additional features. Biopython is getting there from a lot of work done through GSoC action in recent years. I also use ete2 from Python on a regular basis and recommend people check that out for phylogenetics work. However, this post is about the future and I think Perl 6 has a lot to offer everyone working in bioinformatics.

XS the Perl5 system of integrating native code into a Perl module has done fantastically well in providing scientists with the ability to integrate high performance C code with the simplicity of Perl’s syntax. But XS in the Perl6 world has been replaced with NativeCall. This library makes it incredibly simple to integrate C code with a Perl6 class or subroutine. I’m not going to cover this here though, the repo has plenty of docs and examples and two previous advent articles demonstrate use cases.

Instead everything on show here are core features available from writing ‘perl6’ at a command prompt. You don’t need any external dependencies so any system with vanilla Rakudo Perl6 can do everything you’re about to see.

The Grammar of life

Hopefully most people have at least heard about DNA, RNA and perhaps even proteins. All of these are what are referred to in the biz as biopolymers, or translated into programming parlance “strings of life”. Essentially each molecule is a big chain of smaller building blocks you can denote in a textual form with a letter. In DNA we have the alphabet GCTA , in RNA we never see a T instead we see a U in its place. Proteins have their own alphabet that is unrelated to DNA and RNA between 21 and 23 characters long, the last two are very quirky and relatively rare.

So in this section I’m going to outline the hip way to parse sequence files in Perl6. This is how sequencing data usually comes into a bioinformatician. A giant text file! I’ll be using our own class to hold the sequence and a well defined grammar for the well loved (not) FASTA format. So to start here is the Grammar as if it fell from heaven as a shooting star this wintery night:

grammar FASTA::Grammar {
    token TOP { <record>+ }

    token record { ">"<id><comment>?"\n"<sequence> }

    token id { <-[\ \n]>+ }

    token comment { " "<-[\n]>+ }

    proto rule sequence {*}
          rule sequence:sym<dna> { <[ACGTRYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<rna> { <[ACGURYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<aa> { <[A..Z\*\-\n]>+ }
}

Cool we have a super formal grammar defined for FASTA and it even looks like we are doing something special for different types of sequence! I guess I had better explain a little, keep in mind that Grammar is just a huge extension to regular expressions which you are hopefully familiar with.

First up we have token and rule. The difference is that token is like a hungry hippo, it consumes the input sequence and is quite greedy. So if a token matches at all it consumes the string and then lets parsing continue. If there is a choice of tokens the longest wins, which can be thought of as the winner in a game of hungry hippos!
A rule is a bit more tricksome, it has backtracking and can regurgitate and give up. Specifically we have a set of rules called sequence with the more abstract one defined with the keyword proto. The reason we have three rules is for each type of biopolymer, they are ordered by the most to least specific. Unfortunately DNA and RNA can both look like protein. So our Grammar is limited in its ability to detect the difference, but so would a human that’s how rubbish this major data format is!
With rule unlike token if the start of a sequence matches DNA but the rest of it looks like protein we back out of matching DNA because the grammar is smart enough to realise it can get a longer match from the other rule, but has to spew up the letters already consumed first. If we had used token we might have had a really short DNA token then a protein one etc. This might all sound a bit complicated but I’ve made a trace using Grammar::Debugger of what happens matching a test.fa file in this Gist.
If you do look at the trace you will see the final important fact about Grammars. We need a TOP token or rule, this is where we start the thought process and parsing of data. So a FASTA file is made of more than one record, that’s our TOP. A record is made of an id, a comment and a sequence and so the logic flows. The Perl6 regular expression syntax is way out of scope but Masak has a nice intro.

Now lets get this Grammar rolling along doing something. If we want to use a sequence in our code we need a basic type to represent it. Below is probably the simplest thing you would want:

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";
}

It’s not the most inspired or interesting class in the world but it will do!

Now how do we hook up our Seq class to our FASTA Grammar to produce some nice objects from our FASTA file? The answer is we need to take action! But not the political kind, just create an Actions class for the Grammar to use when parsing. By default we could call .parse on our grammar above and we would get all the text chunked up into nicely named labels as a giant tree of Match variables. Some of the time that’s what you want. Action classes have methods that match the names of tokens and rules in a grammar. Each method is called at the corresponding point in the Grammar’s parse tree and given the chance to make something more interesting, replacing all the crummy Match variables in that part of the output. So what do we want:

class FASTA::Actions {
    method TOP($/) {
        make $<record>>>.ast;
    }
    method record($/) {
        make Seq.new(
            id => ~$<id>,
            comment => ($<comment>) ?? (~$<comment>).trim !! '',
            sequence => $<sequence>.subst("\n","", :g)
        );
    }
}

So the logic above is whenever you parse out the whole of a record we want to make a new Seq object and shove it in the resulting match tree for later. You can see that we do some trimming and cleaning up of white space in the action before creating a clean Seq object. Finally the TOP action basically just says that the final product of .parse should be all the record matches, which we just transformed into Seq objects, so it’s literally a list of Seq! Nice! Lets give it a whirl on that test file I mentioned:

Input

>hello
GCTATATAAGC
>world prot
TATAKEKEKELKL

Code

my $grammar = FASTA::Grammar.new;
for $grammar.parsefile("test.fa", actions => FASTA::Actions).ast -> $seq {
    say $seq.id;
}

Output

hello
world

Ultimately we didn’t need the added complexity of matching a sequence as DNA/RNA/protein but it was kind of cool, and at a later date we might want to do something different and have fancier Seq objects. I’ll leave exploration of this as a task to the interested reader :P

A Bag full of sequence

So I’m concerned that our Seq class is kind of crap still. It might as well just be a hash of values at this point. Sure we get a type which is nice, but it just doesn’t have anything cool about it. I think a nice addition is if we made our class work a little better with the rest of the Perl6 family of operations, letting more of the languages magic get at our data. The results of this simple change might actually impress you!

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";

    multi method Numeric (Seq:D: --> Numeric) {
        $.sequence.codes;
    }

    multi method Bag (Seq:D: --> Bag) {
        bag $.sequence.comb;
    }

    multi method Str (Seq:D: --> Str) {
        $.sequence;
    }

    multi method gist (Seq:D: --> Str) {
        ">$.id $.comment\n$.sequence";
    }
}

Above I’ve introduced some snazzy new methods specific to our class. Anyone familiar with toString from Java should get what both the Str and gist methods are upto. They are well observed methods in Perl6 that deal with interpreting an object as a string. But it’s probably not clear at all why you would want two of them? Well simply put the Str method is there for the pure string interpretation of our object. If I use operators that expect Str this would be called to automagic our Seq into a string. The gist is kind of a neat one, if we use the higher level say function this calls the .gist on our object and prints that since it’s already assumed you want something a bit more human readable. Even more neat is if you are using the command line REPL the .gist is immediately called on an object that resulted from an expression. So when we create a new Seq in the Perl6 REPL or just give the variable name on its own we get FASTA ready to copy and paste! Take a look at the REPL dump below and see if you can work out where we are implicitly using $seq as a Str rather than Seq:

> my $seq = Seq.new(id=>'hello',comment=>'world',sequence=>'GCTACCTATAAAGC');
>hello world
GCTACCTATAAAGC
> say $seq
>hello world
GCTACCTATAAAGC
> "junk" ~ $seq
junkGCTACCTATAAAGC
> my $big-seq = Seq.new(id=>'whole',comment=>'new world',sequence=> $seq x 3);
>whole new world
GCTACCTATAAAGCGCTACCTATAAAGCGCTACCTATAAAGC
> print $seq
GCTACCTATAAAGC>

But wait! There’s more. For the low low price of just two more methods we get numeric wrangling and statistical modelling! The Numeric method does what it says on the tin, and we defined this as the length of the sequence which feels reasonable as a numerical representation.

> $seq + $big-seq
56
> ($seq + $big-seq) / $seq
4

Bag creates a Bag type object from our sequence. Basically the Bag we create is an immutable counting dictionary of all the letters in the sequence. So we now have the first step in some frequency analysis, letters and a count of how often they occurred. But what is even more awesome about Bags in Perl6 are the pick and roll methods. These allow you to sample the keys of the Bag based on their frequency! So if you wanted to create some dummy sequences to do statistical tests with it’s easy as pie! Roll keeps the letters in the bag as you sample whereas pick removes them and then alters the distribution of the next pick. Below I give an example where we .pick strings that are as long the input sequence but will be randomly different but with the same distribution of characters. This sort of dummy sequence is useful for calculating the statistical significance of our sequence.

> $seq.Bag
bag(G(2), C(4), T(3), A(5))
> $seq.Bag.pick($seq).join
CAGATCGTAAATCC
> $seq.Bag.pick($seq).join
GAATCCCGTACATA

Look at that we even use $seq as the length to pick to! Perhaps not the most obvious or readable thing in the world, but for playing on the REPL it’s a nice feature and not utterly unintuitive to people used to context in Perl.

A Range of possibilities

A common task in bioinformatics is dealing with what is commonly referred to as “annotation” type data. This can be anything from specifying the location of a gene in a whole genome, to specifying a patterned site within a protein sequence. In BioPerl there is a specific and special Bio::Range class for dealing with the sort of data that represents a segment on a number line with a start and an end position. In Perl6 this is built right into the language with most of the features any bioinformatician would be happy with. Plus they can be wrangled to Sets and use those operators. So creating a Range couldn’t be easier and more obvious, you’ve all done it before!

> my $range = 1..10;
> $range.WHAT.say;
(Range)

This is the exact same syntax and object used if you wanted to create a for loop over a given range both in Perl5 and Perl6.

To show what we can do with this basic type in Perl6 I’ve taken some data about the gene p53 which is a well known one for the development of many cancers. It’s all real data and you can see a coloured block diagram of the protien annotation data here. You think of each block as being single Range object you get some idea of what I’m harping on about. So lets start out with some DNA and some protein features in code but remember we could just makes these from a file just as easily:

#Genomic chromosome coordinates of the p53 gene in human
my $p53_gene = 7_565_097 .. 7_590_863;

#A single nucleotide polymorphism AKA a point mutation
my $snp = 7_575_996;

#Some annotated protein domains
my %domains = p53-TF => 97..287, p53-Tetramer => 319..357;

#Predicted interaction site
my $binding_site = 322..355;

#A known protein Phosphorylation site
my $phospho_site = 99;

So good so far, we have some one off integer values some ranges and a cool hash of ranges! The bonus Perl6 features come from operators that accept a Range as their operands. For example most importantly set operations work as you might expect:

#Work out if the SNP is within the gene
say "MUTATION!!!" if $p53_gene (cont) $snp;

#Find out which domains contain the phosphorylation site
my %phosphorylated = %domains.grep({ .value (cont) $phospho_site })

#Find out which domains are in the binding site and what percentage is binding
my %binding = %domains.map({
                            $_.key => 100 / $_.value.elems * ($_.value (&) $binding_site)
                      })
                      .grep(*.value > 0.0);

Hopefully a lot of the above is almost self explanatory once I introduce the set operators for contains (cont) or ∋ and intersection (&) or ∩. I chose the ASCII variants, mostly because I don’t have a ∩ key on my keyboard and I wouldn’t want my code to be trashed if someone used a non Unicode editor! The contains returns a Bool result of if the right operand contains the left and the intersection operator returns the set of numbers that intersect both of the Ranges. The final hash wrangling is quite nice, it keeps the keys of the %domains hash but replaces the values with the percentage coverage with our binding site and only gives a result if there was any coverage between the two Ranges. Hopefully almost any bioinformatician can see how nice these sort of primitive types and operations are to have in the core of a language.

A Curio

If any of the above has you at least a little curious if not excited to see what Perl6 can do within Bioinformatics then you can check out more at either Chris Fields official BioPerl6 repo or my own playpen. I’ll leave you with one final code nugget to chew on. I found this by accident just writing a method to wrangle DNA sequence to RNA sequence in Perl6. Something you might like to add to our Seq class. This is approximately the session I had at the REPL:

> my $dna = ‘GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAA’;
> my $rna = $dna ~~ tr/GCAT/GCAU/;
> $rna.WHAT.say
(StrDistance)

WAT?! So I was originally expecting to just have the translated string. As an aside you should use the method $dna.trans(‘T’=>’U’) if you do want that. Instead I got back this odd object StrDistance… What could it be I thought, so I did some basic inspection to take a look:

> $rna.perl
StrDistance.new(before => "GATG…TAAAA", after => "GAUG…UAAAA")
> $rna.^methods
Bool Numeric Int <anon> <anon>
> $rna.^attributes
Str $!before Str $!after Int $!distance

So in the .perl we see that .before and .after are the two strings the DNA and the newly translated RNA. But there is also this other guff like .Numeric and .Int. Turns out its something awesome!

> say +$rna;
13

Lucky number 13… or the Hamming distance between the two strings! Anyone from the world of Bioinformatics will instantly see this as quite a handy feature. Perl6 wasn’t created or invented for Bioinformatics, but it has so many accidentally great features laying around relatively uncommented on.

Sorry for the blowhard post but I thought it might be nice to put these ideas all in one place. Have a happy Christmas and New Year!

Day 5 – Act with great Responsibility

The past year in Rakudo-land has been very exciting. It has brought us a new, full-fledged backend in the form of MoarVM. And with that came extended support for asynchronous execution of code. This means you can now more easily try out all these asynchronous features, without having to suffer from the long startup time of the JVM backend.

The “Supply” is still the main mechanism to use the Reactive Programming paradigm. Here is a simple (silly) example of using a Supply:

1: my $s = Supply.new;
2: $s.act: { print " Fizz" if $_ %% 3 }
3: $s.act: { print " Buzz" if $_ %% 5 }
4: $s.act: { print " $_" unless $_ %% 3 || $_ %% 5 }
5: $s.emit($_) for 1..20;
======================
1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 Fizz Buzz 16 17 Fizz 19 Buzz

Let’s take this example apart. Line 1 just creates a new Supply. By itself, it doesn’t do anything at all, except for creating the object, of course. Lines 2, 3 and 4 set up code (closures) to be executed every time a new value arrives in the Supply $s.

Line 5 sends 20 consecutive values to the Supply. This in turn causes the closures to be executed (in turn for each value emitted). The result is a nice FizzBuzz.

But but…

Careful watchers of this code may wonder about two things. First of all: what’s emit ? Is that something new? No, it is not (really). It used to be called more. At the Austrian Perl Workshop last October it was decided to rename this method to emit (as nobody was really happy with any more). So what happens if you have code that uses more? Well, it will still work, but you will get a deprecation warning after your process finishes:

Saw 1 call to deprecated code during execution.
===============================================
Method more (from Supply) called at:
t/spec/integration/advent2014-day05.t, line 13
Deprecated since v2014.10, will be removed with release v2015.10!
Please use emit instead.
-----------------------------------------------

Note that even though the same method was called 20 times, it is only mentioned once. And that’s because all the calls happened at the same location in the code. If you have seen deprecation messages before, you might notice that it now also mentions when it was deprecated, and when the feature will most likely be removed.

The second thing that a careful watcher might wonder about is: why the use of act? Is that a also a rename (from tap)? No, it is not. The act method was added last April. It ensures that your closure will be the only one running on that Supply, even if the emits are coming from different threads. That is a much safer default than tap, where multiple threads can be executing the closure you provided at the same time.

Sharing everything comes with great power and great responsibility

So why is this important? This is important because all variables visible in a specific scope, are readable and changeable by all threads simultaneously. Reading a variable from different threads at the same time, is not an issue in Perl 6 (apart from some minor known issues to be resolved after the Great List Refactor). Changing a variable from multiple threads simultaneously, is an issue that may cause data loss, memory corruption and segfaults. To give an example:

1: my $times = 100000;
2: my $a = 0;
3: $*SCHEDULER.cue: { $a++ } for ^$times; # this may segfault because of unguarded changes
4: sleep 5; # poor man’s way to wait for all code to finish
5: say “Missed {$times - $a} updates”;
===================================
Missed 9 updates

Line 1 sets up the number of iterations we’re going to do. Line 2 initializes the variable that’s going to be incremented. Although ++ is magic, we initialize the variable explicitly, so that each increment will follow the same internal code path. Line 3 cues the increment to be executed that many times (in a very low level way using the underlying scheduler). Line 4 waits for the execution to be finished. And line 5 shows the result (if we didn’t segfault, which we might).

So please use act rather than tap, unless you are sure that closures will not be updating any shared variables simultaneously. Which could be hard to tell if you’re using external modules inside the closure.

FizzBuzzing some more

Coming back to the first example: should we have used act there, or could we have used tap? We could have used tap there, because print is threadsafe. It becomes a different story if we would have used a (shared) array instead of print:

1: my @seen;
2: my $s = Supply.new;
3: $s.act: { @seen.push: "Fizz" if $_ %% 3 }
4: $s.act: { @seen.push: "Buzz" if $_ %% 5 }
5: $s.act: { @seen.push: $_ unless $_%%3 || $_%%5 }
6: await do for 1..20 {
7:     start { rand.sleep; $s.emit($_) }
8: }
9: say @seen;
==========
11 14 4 Fizz 17 19 Fizz Buzz 16 13 Buzz Fizz Buzz 7 Fizz 8 2 Buzz 1 Fizz Fizz

Hmmm… that has the right number of Fizz’s and Buzz’s, but clearly the order is wrong. But because we have used act rather than tap, we are at least sure that no push was executed simultaneously on the @seen array. So we didn’t lose any values, nor did we segfault.

You might ask: what are that lines 6-8 about? Well, that’s making sure the emit’s are done in a random order, spread out in time (because the 0 to 1 second random sleep from rand.sleep, and the start making it run in a separate thread).

So, how could you make this work in a parallel way? Well generally, if you have some kind of identifier for each value emitted, then you can use that to store the result in a shared data-structure. In this case, $_ is exactly that. So with a slight modification:

1: my @seen;
2: my $s = Supply.new;
3: $s.act: { @seen[$_] = "Fizz" if $_ %% 3 }
4: $s.act: { @seen[$_] ~= "Buzz" if $_ %% 5 }
5: $s.act: { @seen[$_] //= $_ }
6: await do for 1..20 {start{rand.sleep;$s.emit($_)}}
7: say @seen[1..20];
=================
1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz 16 17 Fizz 19 Buzz

By assigning directly to an element in the array (lines 3, 4, 5), we automatically put the information in the right place. And because we are (implicitely) guaranteed that act’s on the same item are always executed in the order they were created, we can simplify the non-FizzBuzz case in line 5.

Conclusion

When using Supplies, please use act on the Supply if you’re not sure whether the closure is going to change shared variables. Only use tap if you’re really 100% sure.

Further reading

Last year saw 2 async execution related posts, the latter showing uses of tap. Please read up on them if the above was not telling you all you needed to know:

Tests

The above code can also be found in roast at t/spec/integration/advent2014-day05.t

Day 24 — Subs are Always Better in multi-ples

Hey look, it’s Christmas Eve! (Also, the palindrome of 42!) And today, we’re going to learn about multi subs, which are essentially synonyms (like any natural language would have). Let’s get started!

An Informative Introduction

multi subs are simply subroutines (or anything related to it, such as methods, macros, etc.) that start with the multi keyword and are then allowed to have the same name as another sub before, provided that sub starts with a multi (or proto — that’s later) keyword. What has to be different between these subs is their signature, or list of formal parameters.

Sound complicated? It isn’t, just take a look at the example below:

multi sub steve(Str $name) {
    return "Hello, $name";
}

multi sub steve(Int $number) {
    return "You are person number $number to use this sub!";
}

Every sub was started with the multi keyword, and has the same name of “steve”, but its parameters are different. That’s how Perl 6 knows which steve to use. If I were to later type steve("John"), then the first steve gets called. If, however, I were to type steve(35), then I’d get the second steve sub.

Equal Footing with built-ins

When you write a multi sub, and it happens to have the same name as some other built-in, your sub is on equal footing with the compiler’s. There’s no preferring Perl 6’s multi sub over yours, so if you write a multi sub with the same name as a built-in and with the same signature, say

multi sub slurp($filename) {
    say "Yum! $filename was tasty. Got another one?";
}

And then try calling it with something like slurp("/etc/passwd"), I get this:

Ambiguous dispatch to multi 'slurp'. Ambiguous candidates had signatures:
:(Any $filename)
:(Any $filename)

Why? Because Perl 6 found two equally valid choices for slurp("/etc/passwd"), my sub and its own, and was unable to decide. That’s the easiest way I know to demonstrate equal footing.

A Fun Conclusion

Now, since it’s Christmas, let’s try writing another open sub, but unlike the built-in open sub, which opens files, this one open presents! Here’s our Present class for this example:

class Present {
    has $.item;
    has $.iswrapped = True;

    method look() {
        if $.iswrapped {
            say "It's wrapped.";
        }
        else {
            say $.item;
        }
    }

    method unwrap() {
        $!iswrapped = False;
    }
}

Now, our open multi sub looks like this:

multi sub open(Present $present) {
    $present.unwrap;
    say "You unwrap the present and find...!";
}

The signature is vastly different from Perl 6’s own open sub, which is a good thing. And here’s the rest of the code, which makes a Present and uses our new multi sub:

my $gift = Present.new(:item("sock"));
$gift.look;
open($gift);
$gift.look;

But wait!

Running this gets us an error in the latest pull of Rakudo:

$ ./perl6 present.p6
It's wrapped.
This type cannot unbox to a native string
⋮

This means that Perl 6’s original open sub is being used, so perhaps it’s being interpreted as an only sub (only subs are the default — only sub unique() {...} and sub unique() {...} mean the same thing). No matter, let’s try adding a proto sub line before our multi sub:

proto sub open(|$) {*}

A proto sub allows you to specify the commonalities between multi subs of the same name. In this case, |$ means “every possible argument”, and {*} means “any kind of code”. It also turns any sub with that name into a multi sub (unless explicitly defined as something other than multi). This is useful if you’re, say, importing a &color sub from a module that isn’t defined as multi (or explicitly as only) and you want to have your own &color sub as well.

After adding this before our multi sub open, we get this result:

$ ./perl6 present.p6
It's wrapped.
You unwrap the present and find...!
sock

It works! Well, that’s it for multi subs. For all the nitty-gritty details, see the most current S06. Enjoy your multi subs and your Christmas Eve!

Day 16: Time in Perl6

It’s the 0x10th day of Christmas, and it’s time for you to learn of time. The synopsis S32::Temporal has been heavily revised in the past year, and today’s post details some of the basics of time as it is implemented in Perl 6.

time and now

The two terms that give the current time (at least what your system thinks is the current time) are time; and now. Here’s a quick example:
> say time; say now;
1292460064
Instant:2010-12-16T00:41:4.873248Z

The first (obvious) difference is that time returns POSIX time, as an integer. now returns an object known as an Instant. Use now if  you want fractions of a second and recognition of leap seconds. time won’t give you fractions of a second or leap seconds, because it returns POSIX time. Which one you use all depends on what you need.

DateTime and friend

Most of the time, you will want to store dates other than now. For this, the DateTime object is what you need. If you want to store the current time, you can use:

my $moment = DateTime.new(now); # or DateTime.new(time)

Otherwise, there are two ways of creating a DateTime object:

my $dw = DateTime.new(:year(1963), :month(11), :day(23), :hour(17), :minute(15));

This is in UTC, if you want to enter it in in another timezone, use the :timezone adverb. Here, only :year is required, the rest defaults to midnight on January 1 of the year.

This way is also pretty tedious. You could instead create a DateTime object by inputting an ISO 8601 timestamp, as a string.

my $dw = DateTime.new("1963-11-23T17:15:00Z");

The Z denotes UTC. To change that, replace Z with +hhmm or -hhmm, where ‘hh’ is the number of hours offset and ‘mm’ the number of minutes.

There is also a Date object, which is created in a similar way, but without hours, minutes, or seconds. For example:

my $jfk = Date.new("1963-11-22"); # you can also use :year and so on

The introduction of the Date object is actually a lesson from CPAN’s DateTime module: sometimes you want to treat days without having to worry about things like time zones and leap seconds. It’s inherently easier to handle pure dates. For example Date has built-in .succ and .pred methods, so you can increment and decrement them in your code.

$jfk++; # Day after JFK

Finally…

That’s about it for Time in P6. To see all the gritty details go to the Temporal specification or ask about it in the community!

Day 13 – The Perl6 Community

Today is the halfway point of the Perl 6 Advent Calendar (and indeed any advent calendar). For 12 days you have learned various nuggets of information, and most likely experimented with those various bits of knowledge. You feel great, wonderful, stuff works!

And then it didn’t work.

You may have found your way to some documentation (available at http://perl6.org/documentation by the way), and potentially even some talk of Real People™ out there ready to help you. Whatever you’ve done, today’s post is given to you to display the two main methods of helping in the Perl 6 Universe.

The Mailing Lists

The first of two major communication channels is the mailing lists. If you head on over to http://perl6.org/community, you can see on the right hand side a list of the several lists you can subscribe to/peruse the archives of. If you have question s about the language, such as “How do I add up all the numers in an array without a loop?”, go for perl6-users. The mailing list is great for those who can’t use IRC or want more permanent discussions. Have an idea for the language? Try perl6-language, where people discuss the language itself. Things like “I think there should be a beat-object-with-catfish operator in the language” go here.

Now, the mailing lists aren’t the only option. Your question could be simple-when-you-know-it, or maybe that feature you really really want is already there! For these reasons, you may want to run your comments by the IRC community.

The #perl6 IRC channel

If you want to use something more instantaneous instead, or you hate the idea of signing up to a mailing list for one question, head over to the channel #perl6 on irc.freenode.net. Here, the community is reputed as being one of the nicest on the Internet (so if you’ve been burned by IRC channels before and vowed “Never again shall I use IRC!”, try #perl6.)

#perl6 is like an eccentric incarnation of The Doctor, multifaceted in its nature. We have punfests, technotalk, lolz, above-your-head language design, and more (thanks sorear for reminding me of that)! So come and ask away!

Final Words

These are the two ways of communication in the Perl 6 universe. There are also blogs, screencasts, and more. Go to http://perl6.org to check it all out!

Day 3 – File operations

Directories

Instead of opendir and friends, in Perl 6 there is a single dir subroutine, returning a list of the files in a specified directory, defaulting to the current directory. A piece of code speaks a thousand words (some result lines are line-wrapped for better readability):

    # in the Rakudo source directory
    > dir
    build parrot_install Makefile VERSION parrot docs Configure.pl 
    README dynext t src tools CREDITS LICENSE Test.pm
    > dir 't'
    00-parrot 02-embed spec harness 01-sanity pmc spectest.data

dir has also an optional named parameter test, used to grep the results


    > dir 'src/core', test => any(/^C/, /^P/)
    Parcel.pm Cool.pm Parameter.pm Code.pm Complex.pm
    CallFrame.pm Positional.pm Capture.pm Pair.pm Cool-num.pm Callable.pm Cool-str.pm

Directories are created with mkdir, as in mkdir('foo')

Files

The easiest way to read a file in Perl 6 is using slurp. slurp returns the contents of a file, as a String,


    > slurp 'VERSION'
    2010.11

The good, old way of using filehandles is of course still available

    > my $fh = open 'CREDITS'
    IO()<0x1105a068>
    > $fh.getc # reads a single character
    =
    > $fh.get # reads a single line
    pod
    > $fh.close; $fh = open 'new', :w # open for writing
    IO()<0x10f3e704>
    > $fh.print('foo')
    Bool::True
    > $fh.say('bar')
    Bool::True
    > $fh.close; say slurp('new')
    foobar

File tests

Testing the existence and types of files is done with smartmatching (~~). Again, the code:

    > 'LICENSE'.IO ~~ :e # does the file exist?
    Bool::True
    > 'LICENSE'.IO ~~ :d # is it a directory?
    Bool::False
    > 'LICENSE'.IO ~~ :f # a file then?
    Bool::True

Easy peasy.

File::Find

When the standard features are not enough, modules come in handy. File::Find (available in the File::Tools package) traverses the directory tree looking for the files you need, and generates a lazy lists of the found ones. File::Find comes shipped with Rakudo Star, and can be easily installed with neutro if you have just a bare Rakudo.

Example usage? Sure. find(:dir<t/dir1>, :type<file>, :name(/foo/)) will generate a lazy list of files (and files only) in a directory named t/dir1 and with a name matching the regex /foo/. Notice how the elements of a list are not just plain strings: they’re objects which strinigify to the full path, but also provide accessors for the directory they’re in (dir) and the filename itself (name). For more info please refer to the documentation.

Useful idioms

Creating a new file
    open('new', :w).close
"Anonymous" filehandle
    given open('foo', :w) {
        .say('Hello, world!');
        .close
    }

Day 10: A Regex Story

On this tenth day of advent, we have the gift of a story …

Once upon a time in a land closer than you might think, an apprentice Perl 6 programmer named Tim was working on a simple parsing problem. His boss (let’s just call him Mr. C) had asked him to parse log files containing inventory information to make sure that there were only valid lines within the file. Each valid line within the file looked like this:

    <part number> <quantity> <color> <description>

So the Perl 6 apprentice, having some familiarity with regular expressions wrote a nice little regex that could be used to identify valid lines. The code that validated each line looked like this:

    next unless $line ~~ / ^^ \d+ \s+ \d+ \s+ \S+ \s+ \N* $$ /

The ~~ operator causes the regex on the right hand side to be matched against the scalar on the left hand side. In the regex itself, ^^ matches the beginning of a line, \d+ matches one or more digits (as the part number and quantity were so composed), \S+ matches one or more non- whitespace characters, \N* matches zero or more non-newline characters, \s+ matches whitespace in between each of these and $$ matches the end of a line. This being Perl 6, these individual components of the regex could be spread out a bit with spaces in between so that it could be more readable. All was good.

But then Mr. C decided that it would be nicer if the individual pieces of information could also be extracted from each in addition to validating it. Tim thought, “No problem, I’ll just use capturing parentheses”. And that’s just what he did:

    next unless $line ~~ / ^^ (\d+) \s+ (\d+) \s+ (\S+) \s+ (\N*) $$ /

After a successful pattern match, each parenthesized portion is available either as part of the match object itself ($/) via $/[0], $/[1], $/[2], or $/[3]. Or it could be accessed via the special variables $0, $1, $2, or $3. Tim was happy. Mr. C was happy.

But then it was discovered that some of the lines didn’t separate the color from the description and that these lines should be considered valid too. Lines where the color was integrated into the description were written a special way. They were always of the form:

    <part number> <quantity> <description> (<color>)

Where description, as before, could be any number of characters including spaces. “Blah!” thought Tim, “now this simple parser suddenly seems more complicated.” Luckily, Tim knew of a place to ask for help. He quickly logged on to irc.freenode.org, joined the #perl6 channel, and asked for help. Someone suggested that he should name the individual parts of his regex to make things easier and then use an alternation to match one or the other alternatives for the last part of the regex.

First, Tim tried naming the parts of his regex. Looking at the synopsis for Perl 6 regex, Tim found he could assign into the match object, so that’s what he did:

    next unless $line ~~ / ^^ $<product>=(\d+) \s+ $<quantity>=(\d+) \s+ $<color>=(\S+) \s+ $<description>=(\N*) $$ /

Now, after a successful match, the individual pieces are available via the match object or via special variables $<product>, $<quantity>, $<color>, and $<description>. This was turning out easier than expected and Tim was feeling quite confident. Next he added the alternation to distinguish between the two different valid lines:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' $<color>=(\S+) ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

In order to isolate the alternation from the rest of the regex, Tim surrounded it with grouping brackets ([ and ]). These group a portion of a regex much like parentheses only without capturing into $0 and friends. Since he needed to match literal parentheses, Tim took advantage of another useful Perl 6 regex feature: quoted strings are matched literally. And because of the assignments within the regex, $<color> and $<description> always contain the appropriate part of the string.

Tim was elated! He showed his code to Mr. C and Mr. C was elated too! “Well done Tim!” said Mr. C.

Everybody was happy. Tim beamed with pride.

However, after the initial glow of success faded, Tim started looking at his work with a more critical eye. For some of the lines where the color was at the end of the description, it was described as “( color)” or “( color )” or “( color )”. His current regex worked, but it would include the color as part of the description and wouldn’t set $<color> at all. That hardly seemed appropriate. Tim initially fixed this by adding more \s*:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' \s* $<color>=(\S+) \s* ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

This worked well, but the regex was starting to look a little cluttered. Again, Tim turned to #perl6 for help.

This time someone named PerlJam piped up, “Why don’t you put your regex in a grammar? That’s what you’re effectively doing by assigning each piece to a variable within the match object.” Wha??? Tim had no idea what PerlJam was talking about. After a brief exchange, Tim thought he understood and knew where to look if he needed more information. After thanking PerlJam, he went back to coding. This time the regex virtually disappeared as it turned into a grammar. Here’s what that grammar and matching code looked like:

grammar Inventory {
    regex product { \d+ }
    regex quantity { \d+ }
    regex color { \S+ }
    regex description { \N* }
    regex TOP { ^^ <product> \s+ <quantity>  \s+
                [
                | <description> \s+ '(' \s* <color> \s*  ')'
                | <color> \s+ <description>
                ]
                $$
    }
}

# ... and then later where the match happens
next unless Inventory.parse($line);

“Well,” thought Tim, “it is certainly more organized.”

Each of his variables in the previous incarnation of the regex simply became named regex within the grammar. Within Perl 6 regex, named regex are matched by simply enclosing the name within angle brackets (< and >). The specially named regex TOP is used when Grammar.parse is called with the scalar to match against. And the behavior is exactly the same as before because when a named regex matches as part of another regex, the text that was matched is saved in the match object and referenced by that name.

And though there was still room for improvement, both Tim and Mr. C were very happy with the result.

The End

* At the time of posting, Rakudo cannot correctly use this grammar to parse lines in the format

    <part number> <quantity> <description> (<color>)