Day 16 – 🎶 Deck The Halls With Perf Improvements 🎶

In the UK our lack of Thanksgiving leaves us with Christmas as a period of giving thanks and reflection up to the New Year. To that end I wanted to put together several bits and pieces I’ve been sat on for a while, around the state of Perl 6 performance, that highlight just how much effort is going into this. I’m not sure the wider programming community appreciates the pace and volume of effort that’s happening.

I’m not a core dev, but I have been a humble user of Perl 6 since just before the 2010 introduction of Rakudo*. Frequently the effort that’s already gone into Rakudo is overshadowed by the perceived effort yet to come. This is especially true of people taking a fresh look at Rakudo Perl 6, who might imagine a fly-by look is what next Christmas will be like. But Perl 6 has historically proven things always improve by next Christmas, for any Christmas you choose.

All the way back in Christmas 2014 I wrote an advent post about why I thought Perl 6 was great for doing Bioinformatics work. What was left out of that post, was why the implementation of Perl 6 on Rakudo was not at all ready for doing any serious Bioinformatics. The performance was really not there at all! My first attempts in Perl 6 (when the Parrot VM was in full force) left me with simple operations taking tens of minutes to execute, that I’d expect to be millisecond level perf. This is unfortunately anecdotal because I didn’t keep good track of timings then. But it was certainly not a great starting place.

However, fast forwarding to 2014 and MoarVM I felt comfortable writing the advent post, because I was cognisant of how much things had improved in my 4 years of being a user. But also that all development was on finishing the language definition and correct implementation back then. I am however a user that has been waiting for perf to get there. The time I think has mostly arrived. For this I have to give thanks to the tremendous daily hard work put in by all the core devs. It’s been incredible and motivating to watch it unfold. For me this Christmas is the goal Christmas, it’s arrived. 👏🏻🎊

I have been running and timing the tests for my BioInfo module that does some basic manipulations of biological sequence data for many years now. It does this in a really terrible way. Lots of mistakes in allocation and dropping of hashes in tight loops etc. But I’ve left this code alone -by now- for the better part of half a decade. Quietly benchmarking in private, and occasionally applauding efforts on the IRC channel when a quantum leap in perf was visible. Sub 10s was a big one! It happened suddenly from 30/40s. That jump came after I hinted on IRC a place my code was especially slow from profiling!

This is a bit of a long term view, if I zoom in on just this last year you can see that performance is still improving by integer factors if not large quantities of time.

Keep in mind that all of these profiles are not from released versions of the Rakudo compiler but HEAD from that day. So occasionally there is the odd performance regression as you can see above that usually isn’t left for a release.

So what’s going on? How are things getting better? There are several reasons. Many of the algorithmic choices and core built in functions in Perl 6 have been progressively and aggressively optimised at a source level (more later). But the MoarVM virtual machine backing Rakudo has also increased in its ability to optimise and JIT down to native code and inline specialised versions of code. This is in part thanks to the –profile option available with Rakudo Perl 6 since 2014 that provides all of this info.In the above plot of how MoarVM has treated the code frames of my compiled Perl 6 tests it should hopefully be clear that since this summer there are considerably more frames being JIT compiled, less interpreted, and almost all of the specialised frames (orange) end up as native JIT (green). If you want to know more about the recent work on the “spesh” MoarVM code specializer you can read about it on Jonathan Worthington’s 4-part posting on his blog. Baart Weigmans also has a blog outlining his work on the JIT compiler, and a nice talk presented recently about lots of new functionality that’s yet to land that should hopefully let many new developers pile on and help improve the JIT. So if that feels interesting as a challenge to you I recommend checking out the above links.

So that’s my benchmark and my goals, most of which revolve around data structure creation and parsing. However, what about other stuff like numeric work? Has that kept up too? Without anyone pushing, like I pushed my view of where things could be improved. The answer is yes!

Once upon a time, back in 2013 a gentleman by the name of Tim King took an interest in finding prime numbers in Perl 6. Tim was fairly upset with the performance he discovered. Rightly so. He started out with the following pretty code:

sub primes-any (Int $max) {
my Int @primes;
@primes.push($_) unless $_ %% any(@primes) for 2 .. $max;
return @primes;
}

Find any prime via a junction on the definition of a prime, really a nice elegant solution! But Tim was aghast that Junctions were slow, with the above code taking him 11s to see the first 1000 primes. Today that super high level code takes 0.96s.

Being unhappy with how slow the junction based code was Tim went on to do the more standard iterative approaches. Tim vanished from online shortly after these posts. But he left a legacy that I continued. His code for the prime benchmarks and my adaptation with results through time can be found in this gist. Below is the punchline with another graph showing the average time taken for finding the first 1000 primes over 100 trials each. Vertical lines in 2015 is a higher standard deviation.

Again with a zoomed in view of more recently (with the latest data point worrying me a little that I screwed up somehow…)

The above convergence to a point, is the overhead of starting and stopping the Rakudo runtime and MoarVM. Finding primes isn’t the effort it once was, with it being marginally slower than just Rakudo starting. At least an order of magnitude faster regardless of how high level and elegant the code solution you choose.

Ok so we’ve seen MoarVM got some shiny new moving parts. But huge effort has been put in by developers like Liz, jnthn, Zoffix and more recently in the world of strings Samcv to improve what MoarVM and Rakudo are actually doing under the hood algorithmically.

Sidenote: I’m sure I am not doing most other devs justice at all, especially by ignoring JVM efforts in this post. I would recommend everyone goes and checks out the commit log to see just how many people are now involved in making Rakudo faster, better, stronger. Im sure they would like to see your thanks at the bottom of this article too!

So saving you a job of checking out the commit log I’ve done some mining of my own looking at commits since last Christmas related to perf gains. Things with N% or Nx faster. Like the following:

3c6277c77 Have .codes use nqp::codes op. 350% faster for short strings

ee4593601 Make Baggy (^) Baggy about 150x faster

Those two commits on their own would be an impressive boost to a programming project in the timescale of a years core development. But they are just two of hundreds of commits this year alone.

git log --since 2016-12-25 --oneline | perl6 -e 'say lines().grep(/(\d+\.?\d*?)\%(" "|$)/).map({/(\d+\.?\d*?)\%(" "|$)/; ~$/[0]}).sort.join("\n")' > %_increase.txt
git log --since 2016-12-25 --oneline | perl6 -e 'say lines().grep(/(\d+\.?\d*?)x(" "|$)/).map({/(\d+\.?\d*?)x(" "|$)/; ~$/[0]}).sort.join("\n")' > x_increase.txt
view raw perf_improvements.sh hosted with ❤ by GitHub

Below are some histograms of the numbers of commits and the % and x multiplier increase in performance they mentioned. You can grep the logs yourself with the code above. There are some even more exciting gains during 2016 worth checking out.

These really are the perf improvement commits for 2017 alone, with more landing almost daily. This doesn’t even include many of the I/O perf gains from Zoffix’ grant, as they were not always benchmarked before/after. 2016 is equally dense with some crazy >1000x improvements. Around ten commits alone this year with 40x improvement! This is really impressive to see. At least to me. I think its also not obvious to many on the project how much they’re accomplishing. Remember these are singular commits. Some are even compounded improvement over the year!

I will leave it here. But really thank you core devs, all of you. It’s been a great experience watching and waiting. But now it’s time for me to get on with some Perl 6 code in 2018! It’s finally Christmas.

Day 15 – Bioinformatics and the joy of Perl 6

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tale of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot of sed+awk use cases for one liners is also a welcome feature for a busy bioinformatician. Perl persists today as a de facto scripting language due to the large community effort known as BioPerl. For those not in the know the API of BioPerl inspired a whole host of work in most other languages such as Biopython, BioJava, BioRuby etc. To date none are quite as feature complete as BioPerl though each has its unique additional features. Biopython is getting there from a lot of work done through GSoC action in recent years. I also use ete2 from Python on a regular basis and recommend people check that out for phylogenetics work. However, this post is about the future and I think Perl 6 has a lot to offer everyone working in bioinformatics.

XS the Perl5 system of integrating native code into a Perl module has done fantastically well in providing scientists with the ability to integrate high performance C code with the simplicity of Perl’s syntax. But XS in the Perl6 world has been replaced with NativeCall. This library makes it incredibly simple to integrate C code with a Perl6 class or subroutine. I’m not going to cover this here though, the repo has plenty of docs and examples and two previous advent articles demonstrate use cases.

Instead everything on show here are core features available from writing ‘perl6’ at a command prompt. You don’t need any external dependencies so any system with vanilla Rakudo Perl6 can do everything you’re about to see.

The Grammar of life

Hopefully most people have at least heard about DNA, RNA and perhaps even proteins. All of these are what are referred to in the biz as biopolymers, or translated into programming parlance “strings of life”. Essentially each molecule is a big chain of smaller building blocks you can denote in a textual form with a letter. In DNA we have the alphabet GCTA , in RNA we never see a T instead we see a U in its place. Proteins have their own alphabet that is unrelated to DNA and RNA between 21 and 23 characters long, the last two are very quirky and relatively rare.

So in this section I’m going to outline the hip way to parse sequence files in Perl6. This is how sequencing data usually comes into a bioinformatician. A giant text file! I’ll be using our own class to hold the sequence and a well defined grammar for the well loved (not) FASTA format. So to start here is the Grammar as if it fell from heaven as a shooting star this wintery night:

grammar FASTA::Grammar {
    token TOP { <record>+ }

    token record { ">"<id><comment>?"\n"<sequence> }

    token id { <-[\ \n]>+ }

    token comment { " "<-[\n]>+ }

    proto rule sequence {*}
          rule sequence:sym<dna> { <[ACGTRYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<rna> { <[ACGURYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<aa> { <[A..Z\*\-\n]>+ }
}

Cool we have a super formal grammar defined for FASTA and it even looks like we are doing something special for different types of sequence! I guess I had better explain a little, keep in mind that Grammar is just a huge extension to regular expressions which you are hopefully familiar with.

First up we have token and rule. The difference is that token is like a hungry hippo, it consumes the input sequence and is quite greedy. So if a token matches at all it consumes the string and then lets parsing continue. If there is a choice of tokens the longest wins, which can be thought of as the winner in a game of hungry hippos!
A rule is a bit more tricksome, it has backtracking and can regurgitate and give up. Specifically we have a set of rules called sequence with the more abstract one defined with the keyword proto. The reason we have three rules is for each type of biopolymer, they are ordered by the most to least specific. Unfortunately DNA and RNA can both look like protein. So our Grammar is limited in its ability to detect the difference, but so would a human that’s how rubbish this major data format is!
With rule unlike token if the start of a sequence matches DNA but the rest of it looks like protein we back out of matching DNA because the grammar is smart enough to realise it can get a longer match from the other rule, but has to spew up the letters already consumed first. If we had used token we might have had a really short DNA token then a protein one etc. This might all sound a bit complicated but I’ve made a trace using Grammar::Debugger of what happens matching a test.fa file in this Gist.
If you do look at the trace you will see the final important fact about Grammars. We need a TOP token or rule, this is where we start the thought process and parsing of data. So a FASTA file is made of more than one record, that’s our TOP. A record is made of an id, a comment and a sequence and so the logic flows. The Perl6 regular expression syntax is way out of scope but Masak has a nice intro.

Now lets get this Grammar rolling along doing something. If we want to use a sequence in our code we need a basic type to represent it. Below is probably the simplest thing you would want:

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";
}

It’s not the most inspired or interesting class in the world but it will do!

Now how do we hook up our Seq class to our FASTA Grammar to produce some nice objects from our FASTA file? The answer is we need to take action! But not the political kind, just create an Actions class for the Grammar to use when parsing. By default we could call .parse on our grammar above and we would get all the text chunked up into nicely named labels as a giant tree of Match variables. Some of the time that’s what you want. Action classes have methods that match the names of tokens and rules in a grammar. Each method is called at the corresponding point in the Grammar’s parse tree and given the chance to make something more interesting, replacing all the crummy Match variables in that part of the output. So what do we want:

class FASTA::Actions {
    method TOP($/) {
        make $<record>>>.ast;
    }
    method record($/) {
        make Seq.new(
            id => ~$<id>,
            comment => ($<comment>) ?? (~$<comment>).trim !! '',
            sequence => $<sequence>.subst("\n","", :g)
        );
    }
}

So the logic above is whenever you parse out the whole of a record we want to make a new Seq object and shove it in the resulting match tree for later. You can see that we do some trimming and cleaning up of white space in the action before creating a clean Seq object. Finally the TOP action basically just says that the final product of .parse should be all the record matches, which we just transformed into Seq objects, so it’s literally a list of Seq! Nice! Lets give it a whirl on that test file I mentioned:

Input

>hello
GCTATATAAGC
>world prot
TATAKEKEKELKL

Code

my $grammar = FASTA::Grammar.new;
for $grammar.parsefile("test.fa", actions => FASTA::Actions).ast -> $seq {
    say $seq.id;
}

Output

hello
world

Ultimately we didn’t need the added complexity of matching a sequence as DNA/RNA/protein but it was kind of cool, and at a later date we might want to do something different and have fancier Seq objects. I’ll leave exploration of this as a task to the interested reader :P

A Bag full of sequence

So I’m concerned that our Seq class is kind of crap still. It might as well just be a hash of values at this point. Sure we get a type which is nice, but it just doesn’t have anything cool about it. I think a nice addition is if we made our class work a little better with the rest of the Perl6 family of operations, letting more of the languages magic get at our data. The results of this simple change might actually impress you!

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";

    multi method Numeric (Seq:D: --> Numeric) {
        $.sequence.codes;
    }

    multi method Bag (Seq:D: --> Bag) {
        bag $.sequence.comb;
    }

    multi method Str (Seq:D: --> Str) {
        $.sequence;
    }

    multi method gist (Seq:D: --> Str) {
        ">$.id $.comment\n$.sequence";
    }
}

Above I’ve introduced some snazzy new methods specific to our class. Anyone familiar with toString from Java should get what both the Str and gist methods are upto. They are well observed methods in Perl6 that deal with interpreting an object as a string. But it’s probably not clear at all why you would want two of them? Well simply put the Str method is there for the pure string interpretation of our object. If I use operators that expect Str this would be called to automagic our Seq into a string. The gist is kind of a neat one, if we use the higher level say function this calls the .gist on our object and prints that since it’s already assumed you want something a bit more human readable. Even more neat is if you are using the command line REPL the .gist is immediately called on an object that resulted from an expression. So when we create a new Seq in the Perl6 REPL or just give the variable name on its own we get FASTA ready to copy and paste! Take a look at the REPL dump below and see if you can work out where we are implicitly using $seq as a Str rather than Seq:

> my $seq = Seq.new(id=>'hello',comment=>'world',sequence=>'GCTACCTATAAAGC');
>hello world
GCTACCTATAAAGC
> say $seq
>hello world
GCTACCTATAAAGC
> "junk" ~ $seq
junkGCTACCTATAAAGC
> my $big-seq = Seq.new(id=>'whole',comment=>'new world',sequence=> $seq x 3);
>whole new world
GCTACCTATAAAGCGCTACCTATAAAGCGCTACCTATAAAGC
> print $seq
GCTACCTATAAAGC>

But wait! There’s more. For the low low price of just two more methods we get numeric wrangling and statistical modelling! The Numeric method does what it says on the tin, and we defined this as the length of the sequence which feels reasonable as a numerical representation.

> $seq + $big-seq
56
> ($seq + $big-seq) / $seq
4

Bag creates a Bag type object from our sequence. Basically the Bag we create is an immutable counting dictionary of all the letters in the sequence. So we now have the first step in some frequency analysis, letters and a count of how often they occurred. But what is even more awesome about Bags in Perl6 are the pick and roll methods. These allow you to sample the keys of the Bag based on their frequency! So if you wanted to create some dummy sequences to do statistical tests with it’s easy as pie! Roll keeps the letters in the bag as you sample whereas pick removes them and then alters the distribution of the next pick. Below I give an example where we .pick strings that are as long the input sequence but will be randomly different but with the same distribution of characters. This sort of dummy sequence is useful for calculating the statistical significance of our sequence.

> $seq.Bag
bag(G(2), C(4), T(3), A(5))
> $seq.Bag.pick($seq).join
CAGATCGTAAATCC
> $seq.Bag.pick($seq).join
GAATCCCGTACATA

Look at that we even use $seq as the length to pick to! Perhaps not the most obvious or readable thing in the world, but for playing on the REPL it’s a nice feature and not utterly unintuitive to people used to context in Perl.

A Range of possibilities

A common task in bioinformatics is dealing with what is commonly referred to as “annotation” type data. This can be anything from specifying the location of a gene in a whole genome, to specifying a patterned site within a protein sequence. In BioPerl there is a specific and special Bio::Range class for dealing with the sort of data that represents a segment on a number line with a start and an end position. In Perl6 this is built right into the language with most of the features any bioinformatician would be happy with. Plus they can be wrangled to Sets and use those operators. So creating a Range couldn’t be easier and more obvious, you’ve all done it before!

> my $range = 1..10;
> $range.WHAT.say;
(Range)

This is the exact same syntax and object used if you wanted to create a for loop over a given range both in Perl5 and Perl6.

To show what we can do with this basic type in Perl6 I’ve taken some data about the gene p53 which is a well known one for the development of many cancers. It’s all real data and you can see a coloured block diagram of the protien annotation data here. You think of each block as being single Range object you get some idea of what I’m harping on about. So lets start out with some DNA and some protein features in code but remember we could just makes these from a file just as easily:

#Genomic chromosome coordinates of the p53 gene in human
my $p53_gene = 7_565_097 .. 7_590_863;

#A single nucleotide polymorphism AKA a point mutation
my $snp = 7_575_996;

#Some annotated protein domains
my %domains = p53-TF => 97..287, p53-Tetramer => 319..357;

#Predicted interaction site
my $binding_site = 322..355;

#A known protein Phosphorylation site
my $phospho_site = 99;

So good so far, we have some one off integer values some ranges and a cool hash of ranges! The bonus Perl6 features come from operators that accept a Range as their operands. For example most importantly set operations work as you might expect:

#Work out if the SNP is within the gene
say "MUTATION!!!" if $p53_gene (cont) $snp;

#Find out which domains contain the phosphorylation site
my %phosphorylated = %domains.grep({ .value (cont) $phospho_site })

#Find out which domains are in the binding site and what percentage is binding
my %binding = %domains.map({
                            $_.key => 100 / $_.value.elems * ($_.value (&) $binding_site)
                      })
                      .grep(*.value > 0.0);

Hopefully a lot of the above is almost self explanatory once I introduce the set operators for contains (cont) or ∋ and intersection (&) or ∩. I chose the ASCII variants, mostly because I don’t have a ∩ key on my keyboard and I wouldn’t want my code to be trashed if someone used a non Unicode editor! The contains returns a Bool result of if the right operand contains the left and the intersection operator returns the set of numbers that intersect both of the Ranges. The final hash wrangling is quite nice, it keeps the keys of the %domains hash but replaces the values with the percentage coverage with our binding site and only gives a result if there was any coverage between the two Ranges. Hopefully almost any bioinformatician can see how nice these sort of primitive types and operations are to have in the core of a language.

A Curio

If any of the above has you at least a little curious if not excited to see what Perl6 can do within Bioinformatics then you can check out more at either Chris Fields official BioPerl6 repo or my own playpen. I’ll leave you with one final code nugget to chew on. I found this by accident just writing a method to wrangle DNA sequence to RNA sequence in Perl6. Something you might like to add to our Seq class. This is approximately the session I had at the REPL:

> my $dna = ‘GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAA’;
> my $rna = $dna ~~ tr/GCAT/GCAU/;
> $rna.WHAT.say
(StrDistance)

WAT?! So I was originally expecting to just have the translated string. As an aside you should use the method $dna.trans(‘T’=>’U’) if you do want that. Instead I got back this odd object StrDistance… What could it be I thought, so I did some basic inspection to take a look:

> $rna.perl
StrDistance.new(before => "GATG…TAAAA", after => "GAUG…UAAAA")
> $rna.^methods
Bool Numeric Int <anon> <anon>
> $rna.^attributes
Str $!before Str $!after Int $!distance

So in the .perl we see that .before and .after are the two strings the DNA and the newly translated RNA. But there is also this other guff like .Numeric and .Int. Turns out its something awesome!

> say +$rna;
13

Lucky number 13… or the Hamming distance between the two strings! Anyone from the world of Bioinformatics will instantly see this as quite a handy feature. Perl6 wasn’t created or invented for Bioinformatics, but it has so many accidentally great features laying around relatively uncommented on.

Sorry for the blowhard post but I thought it might be nice to put these ideas all in one place. Have a happy Christmas and New Year!