Day 2 – Bioinformatics with Perl 6

Summer of 2016 and I’m preparing to help teach a class on metagenomics with my boss, Dr. Bonnie Hurwitz, at The University of Arizona.  She and I are both alumi of Dr. Lincoln Stein’s lab when he was at Cold Spring Harbor where we worked on Gramene, a comparative genomics platform for plants.  Way back in 1999, Lincoln created an intensive two-week course at CSHL called “Programming for Biologists” to teach everything from Unix to Perl to BLAST and CGI.  I was fortunate to be a teaching assistant several times over the years and came to enjoy helping bench scientists learn computational tools and techniques.  In the fall 2015 debut class of Bonnie’s class, we used Lincoln’s material to teach Perl 5, but I was ready to push into a new language.

Over the years, I’ve played with Python, Lisp, Ruby, Haskell, Prolog, and others.  Python would have been the obvious choice to teach our students, but I felt like I already knew an interpreted, dynamically typed language.  While I admire the type safety of Haskell, I can’t imagine being able to teach command-line Unix, HPC, metagenomic analysis, and Haskell to absolute beginners in one semester.  I decided to spend a month over the summer writing in Perl 6, and it didn’t take long until I was hooked.

There were no print books available and only a handful of online resources like https://docs.perl6.org/http://perl6intro.com/, and https://learnxinyminutes.com/docs/perl6/.  None of these were tailored to beginning scientists, so I started writing my own.  My idea was to teach solutions to common problems in biology using the various styles and strengths of the language.  For each task, I present maybe 2 or sometimes 6 versions, usually getting shorter as I teach more powerful techniques gained by functional programming techniques or object-oriented programming.

As an example, our students are writing their final program for the semester, wrapping up a re-analysis of the skin microbiome (Grice et al 2011).  Over these last few months, they’ve followed a series of detailed protocols to download the raw reads, push them through quality control measures, assemble into “contigs” (contiguous bits of sequence), call putative genes, estimate taxonomy, and annotate gene to determine the functions of the organisms at the various body sites (e.g., armpit vs toe web).  Their last assignment is to annotate the contigs with KEGG categories.

Since KEGG supports itself with yearly subscriptions, among other funding, they don’t make it exactly easy to get all the data our students need.  We used uproc to attach KEGG identifiers, and we want to group those identifiers into categories to find the top 5 functions happening in each student’s subset of samples. I found a handy shell script to use the KEGG API to download KEGG-IDs-to-pathways and pathways-to-categories.  So now we need to link KEGG IDs (e.g., “K01425”) to the pathway category (e.g., “GABAergic synapse”) via the “path:map04727”:

$ head -1 path:map04727.ko
path:map04727 ko:K01425
$ grep map04727 pathway.list
path:map04727 GABAergic synapse

Here’s a script that accomplishes the task:

#!/usr/bin/env perl6

sub MAIN (
    Str  :$pathway-list where *.IO.f = 'pathway.list',
    Str  :$pathway-dir  where *.IO.d = 'pathways',
    Str  :$out-file = 'kegg-cats.txt',
    Bool :$force    = False,
) {
    if $out-file.IO.f && $force == False {
        my $overwrite = prompt "'$out-file' exists. Overwrite [yN]? ";
        exit unless $overwrite.lc.starts-with('y');
    }

    my $out-fh = open $out-file, :w;
    my %map    = $pathway-list.IO.lines.map(*.split("\t")).flat;

    for dir($pathway-dir).grep(/'.ko'$/).kv -> $i, $ko {
        printf "%3d: %s\n", $i + 1, $ko.basename;

        for $ko.IO.lines.map(*.split("\t")) -> ($map-id, $ko-id) {
            my $cat     = %map{ $map-id } or next;
            my $kegg-id = $ko-id.subst(/^'ko:'/, '');

            if $kegg-id.match(/^ K \d ** 5 $/) {
                $out-fh.put(join "\t", $kegg-id, $cat);
            }
        }
    }

    put "Done, see '$out-file'.";
}

I’d like to break this down to explain all the lovely goodies.  First and thing-that-completely-sold-me-on-Perl-6 is the special MAIN subroutine and all the goodness packed into signatures.  Named arguments, e.g., “–pathway-list,” can defined by creating Pairs, which is as simple as putting a “:” in front of the variable that will hold the argument’s value.  I can assign a default value with “=”, constrain the value with a Type like “Str” or “Bool,” and even add arbitrary conditions like checking if the argument is an existing file (“*.IO.f”) or directory (“*.IO.d”).  If I see myself reusing those same checks (e.g., I take multiple “file/dir” arguments), I can easily create my own “subset”:

subset File of Str where *.IO.f;
sub MAIN (File :$file1, File :$file2) { ... }

I like to use reasonable defaults for my arguments, and here I wanted to show how I could check if the output file already exists and how to ask the user via “prompt” if they wish to overwrite it.  In Perl 5, I would have written:

exit unless lc($overwrite) =~ /^y/;

A direct translation to Perl 6 would use “.lc” (lowercase) as a Str method and using the ~~ “smart match” operator:

exit unless $overwrite.lc ~~ /^y/;

But here I wanted to show a more Python-ish use of “starts-with” so as to avoid freaking out beginners with regular expressions.  (I like that Perl has a long history of borrowing/stealing the best features of other languages!)

I am very happy with the new “open” routine as the Perl 5 way always seems backwards:

open my $fh, '<', $file; # Perl 5 open $file for read
my $fh = open $file;     # Perl 6 open $file for read

Here, I need to open the output file for writing, so I pass the “:w” (writable) flag which is just a short-hand for the Pair w => True.

The next line is an extremely condensed way to read a tab-delimited file of two columns directly into a hash.  Here is what my “pathway.list” file looks like:

$ head -3 pathway.list
path:map00010 Glycolysis / Gluconeogenesis
path:map00020 Citrate cycle (TCA cycle)
path:map00030 Pentose phosphate pathway

From that, I want a list of the “path:map” string to the category on the right.  So, let’s use the REPL (the second thing that totally sold me on Perl 6) to see how this is built:

$ head -3 pathway.list > test
$ perl6
To exit type 'exit' or '^D'
> 'test'.IO.lines
(path:map00010 Glycolysis / Gluconeogenesis path:map00020 Citrate cycle (TCA cycle) path:map00030 Pentose phosphate pathway)
> 'test'.IO.lines.map(*.split("\t").join('=')).join(', ')
path:map00010=Glycolysis / Gluconeogenesis, path:map00020=Citrate cycle (TCA cycle), path:map00030=Pentose phosphate pathway

File I/O is very easy (IMHO) — I can coerce/cast a variable with “.IO” and then call “lines” (or “words” or even “comb” if I wanted letters).  Here I want to process each line, using our trusty old “map” (from functional programming) to apply the “split” function to the “*” (Whatever) to get a list which, for visual purposes, I join on “=”.  The result of the lines/map is itself a list which I joined on commas so you can see the two lists together.

Something that is completely different between Perl 5 and 6 is how they handle lists of lists.  Perl 5 would flatten them:

  DB x ((1,2), (3,4))
0 1
1 2
2 3
3 4
 DB x scalar ((1,2), (3,4))
0 4

But Perl 6 allows nesting:

> ((1,2), (3,4))
((1 2) (3 4))
> ((1,2), (3,4)).elems
2

But I can’t create a hash from a list of lists, only from a list of pairs:

> my %h = ((1,2), (3,4))
{1 2 => (3 4)}
> my %h = 1 => 2, 3 => 4
{1 => 2, 3 => 4}

I need to flatten my list-of-lists:

> my %h = ((1,2), (3,4)).flat
{1 => 2, 3 => 4}

And that is how I get my hash-from-a-tab-file:

> my %h = 'test'.IO.lines.map(*.split("\t")).flat
{path:map00010 => Glycolysis / Gluconeogenesis, path:map00020 => Citrate cycle (TCA cycle), path:map00030 => Pentose phosphate pathway}

There are a few things I’d like to explain in this line:

for dir($pathway-dir).grep(/'.ko'$/).kv -> $i, $ko

The “dir” routine gives you a directory listing, and, since my type constraint already checked that the variable is an existing directory, I can rest assured that this will work.  The result of “dir” is a list of IO::Path objects, and I can “grep” (“filter” in some other languages) for those that smart-match to the pattern of the string “.ko” anchored to the end (“$”).  (I could have also used the Python-like “ends-with” method.)

The result of the dir/grep call is a list, and as I wrote in another post, you can call call “kv” (as well as “pairs“) on any list to get both the position and value of each member:

> ("foo", "bar").kv
(0 foo 1 bar)
> ("foo", "bar").pairs
(0 => foo 1 => bar)

The other bit of awesomeness is that Perl 6 allows you to consume a variable number of elements of  a list with optional defaults if the list is not evenly divisible by the number of elements you’ve requested:

> for 1..4 -> $i { $i.say }
1
2
3
4
> for 1..4 -> $i, $j { put "$i, $j" }
1, 2
3, 4
> for 1..4 -> $i, $j="NA", $k="NA" { put "$i, $j, $k" }
1, 2, 3
4, NA, NA

Heck, you don’t even have to declare them as we have implicit variables:

> for 1..4 { put "$^a, $^b" }
1, 2
3, 4

So then I have a (zero-offset which is why I add 1) file number and name which I can use to print my progress using “printf.”  In Perl 5, I would have to establish my counter $i before the loop (which means it would continue to be visible after the loop) and be sure to increment the counter while understanding the nuances of ++$i vs $i++:

# Perl 5
my $i = 0;
for my $word (qw[foo bar]) {
    printf "%3d: %s\n", ++$i, $word;
}

While trying similar things in Haskell, I learned to zip (Z) an infinite list of integers (Haskell, like Perl 6, is lazy!) with my list of interest like so:

> for 1..* Z "foo", "bar" -> ($i, $word) { put "$i = $word" }
1 = foo
2 = bar

OK, that was a long tangent, so let’s return to the program.  We left off with this statement to find all the pathway/KEGG ID files:

for dir($pathway-dir).grep(/'.ko'$/).kv -> $i, $ko

As before, I use “.IO.lines” to read each mapping file, also running each line through a similar “split” on tabs to break the lines into the mapping ID and KO ID:

for $ko.IO.lines.map(*.split("\t")) -> ($map-id, $ko-id)

The long-hand way to do this would be:

for $ko.IO.lines -> $line {
    my ($map-id, $ko-id) = $line.split("\t");
}

As in Perl 5, “next” and “last” are still part of control flow, and I skip over mapping IDs that I didn’t find in the mapping file:

my $cat = %map{ $map-id } or next;

The KO ID looks like “ko:K00001”, so I need to remove the leading “ko:” which is easily accomplished with the “subst” (substitute) operation:

my $kegg-id = $ko-id.subst(/^'ko:'/, '');

In Perl 5 I would have done this, which I feel is unquestionably more opaque:

(my $kegg_id = $ko_id) =~ s/^ko://;

It’s worth noting that in Perl 5 I might also have made the decision to reuse and mutate the $ko_id like so:

$ko_id =~ s/^ko://;

But in Perl 6, the my $ko-id variable is not actually variable — by default it is read-only.  In the following example, you see that “subst” returns a new string with the substitutions requested, but the string itself remains unchanged:

> for "foo", "bar" -> $s { put $s.subst(/<[aeiou]>/, 'X', :g); put $s }
fXX
foo
bXr
bar

If I try to call the mutator method, I get an error:

> for "foo", "bar" -> $s { $s.subst-mutate(/<[aeiou]>/, 'X', :g); put $s }
Cannot resolve caller subst-mutate(Str: Regex, Str, :g); the following candidates
match the type but require mutable arguments:
 (Cool:D $self is rw: |c is raw)
 (Str:D $self is rw: $matcher, $replacement, :ii(:$samecase), :ss(:$samespace), :mm(:$samemark), *%options)
 in block  at  line 1

I have to explicitly say the variable is a copy:

> for "foo", "bar" -> $s is copy { $s.subst-mutate(/<[aeiou]>/, 'X', :g); put $s }
fXX
bXr

All this is a nod to the very good influence of purely functional programming languages where data is immutable.  It’s not nice to always be constrained by this idea, but here the application, I believe, enforces a very good programming practice.

Some of the KEGG IDs might be “KO” (KEGG Orthology), so I want to ensure that I’m only dealing with strings that look like “K00001” — a capital “K” followed by five digits:

if $kegg-id.match(/^ K \d ** 5 $/) {

It’s also possible to write that with the smart-match operator.  Both will return Match if successful:

> 'K00001' ~~ /^ K \d ** 5 $/
「K00001」
> say ('K00001' ~~ /^ K \d ** 5 $/).WHAT
(Match)
> my $m = 'K00001' ~~ /^ K \d ** 5 $/
「K00001」
> dd $m
Match $m = Match.new(ast => Any, list => (), hash => Map.new(()), orig => "K00001", to => 6, from => 0)

The ecosystem of native Perl 6modules currently includes BioInfo and BioPerl6.  What’s amazing, though, is that you can use Inline::Perl5 to bring in Perl 5’s BioPerl:

$ cat seqs.fa
>GON5MYK01BC2ZU
CTGTCTGTAACCTCTGCCAGCATGCAGGCGCGTGCTGCGTACCTGAACGAAGGTTGGAAC
TGCATGAACTTGGTGAAGAACATCACCAAGCAAGACCCGCTAGAGTTTGTCGCTAACCGC
CTCACTAGCTATTGGCAGCGCGTAGCCCAGCGTCGCGCTATTCGCCACCACTATCGGCAT
CTACAACGAAAATGTT
>GON5MYK01AXF84
CTTGAGCAAGACCTGCGCCGACTTGTCCTGTTTTCAACTAACTTAGCACCTACGTTACGA
ATAAATCTATCAAATTGTGGTGTGCCAAAAATATCTGATATGCTTGGTCTTTCTGTTTGT
TCTGTAACTTTATCTGTAGGTTTTTC
>GON5MYK01DJ0V1
CATTACTGAAATCTGTCCAGTATTCCACATACAAGAAAGGTCTTGTATCGGAGCTGTTGT
CTACACCTTTAATAGTAAAGTGACGTGTCAATATCTGAACCAGTGATCCAACCTTGTGCT
TATGTCTGACTTCATCATCAAGTGTTAGTGTCGTG
>GON5MYK01DR4LG
AGAAGTAAAAATCAGCAGAGACTTCGACATCGAAAGATTAATTGGTCAAGATATTACAGC
TTTAACATCTCTATTCGATCAACAAGTCATTAGATAGAGACGAATTTAGAGATATTTTAG
TTCAAGGTG
$ cat inline.p6
#!/usr/bin/env perl6

# This is using p5 BioPerl Bio::SeqIO
use Bio::SeqIO:from<Perl5>;

my $file = @*ARGS.shift;

# Note: left side needs quotes; keys are not automaically strings in p6
my $in = Bio::SeqIO.new('-format' => 'fasta', '-file' => $file);

my $ct = 0;
while $in.next_seq -> $record {
    $ct++;
    say $record.display_id;
}

say "Count: $ct";
$ ./inline.p6 seqs.fa
GON5MYK01BC2ZU
GON5MYK01AXF84
GON5MYK01DJ0V1
GON5MYK01DR4LG
Count: 4

Like Perl 5, version 6 still (IMO) makes easy things easy and hard things possible, and I hope this article makes you want to use Perl 6 today.  I will continue to add chapters and examples to my book and Github repo in the hopes that people will steal as much as they need to get started.

Happy Hacking to one and all!

6 thoughts on “Day 2 – Bioinformatics with Perl 6

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s