Day 15 – Bioinformatics and the joy of Perl 6

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tale of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot of sed+awk use cases for one liners is also a welcome feature for a busy bioinformatician. Perl persists today as a de facto scripting language due to the large community effort known as BioPerl. For those not in the know the API of BioPerl inspired a whole host of work in most other languages such as Biopython, BioJava, BioRuby etc. To date none are quite as feature complete as BioPerl though each has its unique additional features. Biopython is getting there from a lot of work done through GSoC action in recent years. I also use ete2 from Python on a regular basis and recommend people check that out for phylogenetics work. However, this post is about the future and I think Perl 6 has a lot to offer everyone working in bioinformatics.

XS the Perl5 system of integrating native code into a Perl module has done fantastically well in providing scientists with the ability to integrate high performance C code with the simplicity of Perl’s syntax. But XS in the Perl6 world has been replaced with NativeCall. This library makes it incredibly simple to integrate C code with a Perl6 class or subroutine. I’m not going to cover this here though, the repo has plenty of docs and examples and two previous advent articles demonstrate use cases.

Instead everything on show here are core features available from writing ‘perl6’ at a command prompt. You don’t need any external dependencies so any system with vanilla Rakudo Perl6 can do everything you’re about to see.

The Grammar of life

Hopefully most people have at least heard about DNA, RNA and perhaps even proteins. All of these are what are referred to in the biz as biopolymers, or translated into programming parlance “strings of life”. Essentially each molecule is a big chain of smaller building blocks you can denote in a textual form with a letter. In DNA we have the alphabet GCTA , in RNA we never see a T instead we see a U in its place. Proteins have their own alphabet that is unrelated to DNA and RNA between 21 and 23 characters long, the last two are very quirky and relatively rare.

So in this section I’m going to outline the hip way to parse sequence files in Perl6. This is how sequencing data usually comes into a bioinformatician. A giant text file! I’ll be using our own class to hold the sequence and a well defined grammar for the well loved (not) FASTA format. So to start here is the Grammar as if it fell from heaven as a shooting star this wintery night:

grammar FASTA::Grammar {
    token TOP { <record>+ }

    token record { ">"<id><comment>?"\n"<sequence> }

    token id { <-[\ \n]>+ }

    token comment { " "<-[\n]>+ }

    proto rule sequence {*}
          rule sequence:sym<dna> { <[ACGTRYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<rna> { <[ACGURYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<aa> { <[A..Z\*\-\n]>+ }
}

Cool we have a super formal grammar defined for FASTA and it even looks like we are doing something special for different types of sequence! I guess I had better explain a little, keep in mind that Grammar is just a huge extension to regular expressions which you are hopefully familiar with.

First up we have token and rule. The difference is that token is like a hungry hippo, it consumes the input sequence and is quite greedy. So if a token matches at all it consumes the string and then lets parsing continue. If there is a choice of tokens the longest wins, which can be thought of as the winner in a game of hungry hippos!
A rule is a bit more tricksome, it has backtracking and can regurgitate and give up. Specifically we have a set of rules called sequence with the more abstract one defined with the keyword proto. The reason we have three rules is for each type of biopolymer, they are ordered by the most to least specific. Unfortunately DNA and RNA can both look like protein. So our Grammar is limited in its ability to detect the difference, but so would a human that’s how rubbish this major data format is!
With rule unlike token if the start of a sequence matches DNA but the rest of it looks like protein we back out of matching DNA because the grammar is smart enough to realise it can get a longer match from the other rule, but has to spew up the letters already consumed first. If we had used token we might have had a really short DNA token then a protein one etc. This might all sound a bit complicated but I’ve made a trace using Grammar::Debugger of what happens matching a test.fa file in this Gist.
If you do look at the trace you will see the final important fact about Grammars. We need a TOP token or rule, this is where we start the thought process and parsing of data. So a FASTA file is made of more than one record, that’s our TOP. A record is made of an id, a comment and a sequence and so the logic flows. The Perl6 regular expression syntax is way out of scope but Masak has a nice intro.

Now lets get this Grammar rolling along doing something. If we want to use a sequence in our code we need a basic type to represent it. Below is probably the simplest thing you would want:

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";
}

It’s not the most inspired or interesting class in the world but it will do!

Now how do we hook up our Seq class to our FASTA Grammar to produce some nice objects from our FASTA file? The answer is we need to take action! But not the political kind, just create an Actions class for the Grammar to use when parsing. By default we could call .parse on our grammar above and we would get all the text chunked up into nicely named labels as a giant tree of Match variables. Some of the time that’s what you want. Action classes have methods that match the names of tokens and rules in a grammar. Each method is called at the corresponding point in the Grammar’s parse tree and given the chance to make something more interesting, replacing all the crummy Match variables in that part of the output. So what do we want:

class FASTA::Actions {
    method TOP($/) {
        make $<record>>>.ast;
    }
    method record($/) {
        make Seq.new(
            id => ~$<id>,
            comment => ($<comment>) ?? (~$<comment>).trim !! '',
            sequence => $<sequence>.subst("\n","", :g)
        );
    }
}

So the logic above is whenever you parse out the whole of a record we want to make a new Seq object and shove it in the resulting match tree for later. You can see that we do some trimming and cleaning up of white space in the action before creating a clean Seq object. Finally the TOP action basically just says that the final product of .parse should be all the record matches, which we just transformed into Seq objects, so it’s literally a list of Seq! Nice! Lets give it a whirl on that test file I mentioned:

Input

>hello
GCTATATAAGC
>world prot
TATAKEKEKELKL

Code

my $grammar = FASTA::Grammar.new;
for $grammar.parsefile("test.fa", actions => FASTA::Actions).ast -> $seq {
    say $seq.id;
}

Output

hello
world

Ultimately we didn’t need the added complexity of matching a sequence as DNA/RNA/protein but it was kind of cool, and at a later date we might want to do something different and have fancier Seq objects. I’ll leave exploration of this as a task to the interested reader :P

A Bag full of sequence

So I’m concerned that our Seq class is kind of crap still. It might as well just be a hash of values at this point. Sure we get a type which is nice, but it just doesn’t have anything cool about it. I think a nice addition is if we made our class work a little better with the rest of the Perl6 family of operations, letting more of the languages magic get at our data. The results of this simple change might actually impress you!

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";

    multi method Numeric (Seq:D: --> Numeric) {
        $.sequence.codes;
    }

    multi method Bag (Seq:D: --> Bag) {
        bag $.sequence.comb;
    }

    multi method Str (Seq:D: --> Str) {
        $.sequence;
    }

    multi method gist (Seq:D: --> Str) {
        ">$.id $.comment\n$.sequence";
    }
}

Above I’ve introduced some snazzy new methods specific to our class. Anyone familiar with toString from Java should get what both the Str and gist methods are upto. They are well observed methods in Perl6 that deal with interpreting an object as a string. But it’s probably not clear at all why you would want two of them? Well simply put the Str method is there for the pure string interpretation of our object. If I use operators that expect Str this would be called to automagic our Seq into a string. The gist is kind of a neat one, if we use the higher level say function this calls the .gist on our object and prints that since it’s already assumed you want something a bit more human readable. Even more neat is if you are using the command line REPL the .gist is immediately called on an object that resulted from an expression. So when we create a new Seq in the Perl6 REPL or just give the variable name on its own we get FASTA ready to copy and paste! Take a look at the REPL dump below and see if you can work out where we are implicitly using $seq as a Str rather than Seq:

> my $seq = Seq.new(id=>'hello',comment=>'world',sequence=>'GCTACCTATAAAGC');
>hello world
GCTACCTATAAAGC
> say $seq
>hello world
GCTACCTATAAAGC
> "junk" ~ $seq
junkGCTACCTATAAAGC
> my $big-seq = Seq.new(id=>'whole',comment=>'new world',sequence=> $seq x 3);
>whole new world
GCTACCTATAAAGCGCTACCTATAAAGCGCTACCTATAAAGC
> print $seq
GCTACCTATAAAGC>

But wait! There’s more. For the low low price of just two more methods we get numeric wrangling and statistical modelling! The Numeric method does what it says on the tin, and we defined this as the length of the sequence which feels reasonable as a numerical representation.

> $seq + $big-seq
56
> ($seq + $big-seq) / $seq
4

Bag creates a Bag type object from our sequence. Basically the Bag we create is an immutable counting dictionary of all the letters in the sequence. So we now have the first step in some frequency analysis, letters and a count of how often they occurred. But what is even more awesome about Bags in Perl6 are the pick and roll methods. These allow you to sample the keys of the Bag based on their frequency! So if you wanted to create some dummy sequences to do statistical tests with it’s easy as pie! Roll keeps the letters in the bag as you sample whereas pick removes them and then alters the distribution of the next pick. Below I give an example where we .pick strings that are as long the input sequence but will be randomly different but with the same distribution of characters. This sort of dummy sequence is useful for calculating the statistical significance of our sequence.

> $seq.Bag
bag(G(2), C(4), T(3), A(5))
> $seq.Bag.pick($seq).join
CAGATCGTAAATCC
> $seq.Bag.pick($seq).join
GAATCCCGTACATA

Look at that we even use $seq as the length to pick to! Perhaps not the most obvious or readable thing in the world, but for playing on the REPL it’s a nice feature and not utterly unintuitive to people used to context in Perl.

A Range of possibilities

A common task in bioinformatics is dealing with what is commonly referred to as “annotation” type data. This can be anything from specifying the location of a gene in a whole genome, to specifying a patterned site within a protein sequence. In BioPerl there is a specific and special Bio::Range class for dealing with the sort of data that represents a segment on a number line with a start and an end position. In Perl6 this is built right into the language with most of the features any bioinformatician would be happy with. Plus they can be wrangled to Sets and use those operators. So creating a Range couldn’t be easier and more obvious, you’ve all done it before!

> my $range = 1..10;
> $range.WHAT.say;
(Range)

This is the exact same syntax and object used if you wanted to create a for loop over a given range both in Perl5 and Perl6.

To show what we can do with this basic type in Perl6 I’ve taken some data about the gene p53 which is a well known one for the development of many cancers. It’s all real data and you can see a coloured block diagram of the protien annotation data here. You think of each block as being single Range object you get some idea of what I’m harping on about. So lets start out with some DNA and some protein features in code but remember we could just makes these from a file just as easily:

#Genomic chromosome coordinates of the p53 gene in human
my $p53_gene = 7_565_097 .. 7_590_863;

#A single nucleotide polymorphism AKA a point mutation
my $snp = 7_575_996;

#Some annotated protein domains
my %domains = p53-TF => 97..287, p53-Tetramer => 319..357;

#Predicted interaction site
my $binding_site = 322..355;

#A known protein Phosphorylation site
my $phospho_site = 99;

So good so far, we have some one off integer values some ranges and a cool hash of ranges! The bonus Perl6 features come from operators that accept a Range as their operands. For example most importantly set operations work as you might expect:

#Work out if the SNP is within the gene
say "MUTATION!!!" if $p53_gene (cont) $snp;

#Find out which domains contain the phosphorylation site
my %phosphorylated = %domains.grep({ .value (cont) $phospho_site })

#Find out which domains are in the binding site and what percentage is binding
my %binding = %domains.map({
                            $_.key => 100 / $_.value.elems * ($_.value (&) $binding_site)
                      })
                      .grep(*.value > 0.0);

Hopefully a lot of the above is almost self explanatory once I introduce the set operators for contains (cont) or ∋ and intersection (&) or ∩. I chose the ASCII variants, mostly because I don’t have a ∩ key on my keyboard and I wouldn’t want my code to be trashed if someone used a non Unicode editor! The contains returns a Bool result of if the right operand contains the left and the intersection operator returns the set of numbers that intersect both of the Ranges. The final hash wrangling is quite nice, it keeps the keys of the %domains hash but replaces the values with the percentage coverage with our binding site and only gives a result if there was any coverage between the two Ranges. Hopefully almost any bioinformatician can see how nice these sort of primitive types and operations are to have in the core of a language.

A Curio

If any of the above has you at least a little curious if not excited to see what Perl6 can do within Bioinformatics then you can check out more at either Chris Fields official BioPerl6 repo or my own playpen. I’ll leave you with one final code nugget to chew on. I found this by accident just writing a method to wrangle DNA sequence to RNA sequence in Perl6. Something you might like to add to our Seq class. This is approximately the session I had at the REPL:

> my $dna = ‘GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAA’;
> my $rna = $dna ~~ tr/GCAT/GCAU/;
> $rna.WHAT.say
(StrDistance)

WAT?! So I was originally expecting to just have the translated string. As an aside you should use the method $dna.trans(‘T’=>’U’) if you do want that. Instead I got back this odd object StrDistance… What could it be I thought, so I did some basic inspection to take a look:

> $rna.perl
StrDistance.new(before => "GATG…TAAAA", after => "GAUG…UAAAA")
> $rna.^methods
Bool Numeric Int <anon> <anon>
> $rna.^attributes
Str $!before Str $!after Int $!distance

So in the .perl we see that .before and .after are the two strings the DNA and the newly translated RNA. But there is also this other guff like .Numeric and .Int. Turns out its something awesome!

> say +$rna;
13

Lucky number 13… or the Hamming distance between the two strings! Anyone from the world of Bioinformatics will instantly see this as quite a handy feature. Perl6 wasn’t created or invented for Bioinformatics, but it has so many accidentally great features laying around relatively uncommented on.

Sorry for the blowhard post but I thought it might be nice to put these ideas all in one place. Have a happy Christmas and New Year!

4 thoughts on “Day 15 – Bioinformatics and the joy of Perl 6

  1. Thanks to Andrew Dalke who reminded me my often flippant yet jovial personality doesn’t often do too well with a cold read on the internet. I have updated the post to sound less dismissive of other BioX projects. That certainly wasn’t the message or attitude I wished people to take away from the post. I’m sorry if anyone felt their efforts were belittled from a misplaced word or two. Andrew also points out the OBF (http://www.open-bio.org/wiki/Main_Page) is in place to organise all of the work I alluded to via GSoC.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s