Day 25 – Calling Numbers Names

This school semester I took my first proof-based class titled “Intro to Mathematical Proof Workshop”. After having taken other math classes (Calculus, Matrix Algebra, etc.), I felt that I didn’t have that much of a mathematical foundation and up to this point, all I had been doing was purely computational mathematics sprinkled with some proofs here and there. Looking back, I found the class quite enjoyable and learning about different theorems and their proofs, mostly from number theory, has given me a new perspective of mathematics.

“How is this related to Perl 6?”, you might be asking. As I mentioned, most of the proofs that were discussed either in class or left for homework were related to number theory. If there’s one thing Perl 6 and number theory have in common is their accessibility. Similar to how the content of the elementary theory of numbers can be tangible and familiar, Perl 6 can be quite approachable to beginners. In fact, beginners are encouraged to write what’s known as “baby Perl”.

Continue reading “Day 25 – Calling Numbers Names”

Day 24 – Topic Modeling with Perl 6

Hi, everyone.

Today, let me introduce Algorithm::LDA.
This module is a Latent Dirichlet Allocation (i.e., LDA) implementation for topic modeling.

Introduction

What’s LDA? LDA is one of the popular unsupervised machine learning methods.
It models document generation process and represents each document as a mixture of topics.

So, what does “a mixture of topics” mean? Fig. 1 shows an article in which some of the words are highlighted in three colors: yellow, pink, and blue. Words about genetics are marked in yellow; words about evolutionary biology are marked in pink; words about data analysis are marked in blue. Imagine all of the words in this article are colored, then we can represent this article as a mixture of topics (i.e., colors).

Fig. 1:
Fig. 1
(This image is from “Probabilistic topic models.” (David Blei 2012))

OK, then I’ll demonstrate how to use Algorithm::LDA in the next section.

Modeling Quotations

In this article, we explore Wikiquote. Wikiquote is a cloud-sourced platform providing sourced quotations.
By using Wikiquote API, we get quotations that are used for LDA estimation. After that, we execute LDA and plot the result.
Finally, we create an information retrieval application using the resulting model.

Preliminary

Wikiquote API

Wikiquote has action API that provides means for getting Wikiquote resources.
For example, you can get content of the Main Page as follows:

$ curl "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content&format=json"

The result of the above command is:

{"batchcomplete":"","warnings":{"main":{"*":"Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application."},"revisions":{"*":"Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."}},"query":{"pages":{"1":{"pageid":1,"ns":0,"title":"Main Page","revisions":[{"contentformat":"text/x-wiki","contentmodel":"wikitext","*":"
\n
{{Main page header}}
\n
{{Main Page Quote of the day}}
\n</div>\n\n
\n{{Main Page Selected pages}}\n{{Main categories}}\n
\n\n
\n{{New pages}}\n{{Main Page Community}}\n
\n\n
\n==Wikiquote's sister projects==\n{{otherwiki}}\n\n==Wikiquote languages==\n{{Wikiquotelang}}\n
\n__NOTOC__ __NOEDITSECTION__\n{{noexternallanglinks:ang|simple}}\n[[Category:Main page]]"}]}}}}

WWW

WWW by Zoffix Znet is a library which provides easy-to-use API for fetching and parsing json very simply.
For instance, as the README says, you can easily get content by jget(URL)<HASHKEY> style:

say jget('https://httpbin.org/get?foo=42&bar=x')<args><foo>;

To install WWW:

$ zef install WWW

Chart::Gnuplot

Chart::Gnuplot by titsuki is a bindings for gnuplot.

To install Chart::Gnuplot:

$ zef install Chart::Gnuplot

In this article, we use this module; however, if you unfamiliar with gnuplot there are many alternatives: SVG::Plot, Graphics::PLplot, Call matplotlib functions via Inline::Python.

Stopwords from NLTK

NLTK is a toolkit for natural language processing.
Not only APIs, it also provides corpus.
You can get stopwords for English via “70. Stopwords Corpus”: http://www.nltk.org/nltk_data/

Exercise 1: Get Quotations and Create Cleaned Documents

At the beginning, we have to get quotations from Wikiquote and create clean documents.

The main goal of this section is to create documents according to the following format:

<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...

The whole source code is:

use v6.c;
use WWW;
use URI::Escape;

sub get-members-from-category(Str $category --> List) {
    my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
    @(jget($member-url)<query><categorymembers>.map(*<title>));
}

sub get-pages(Str @members, Int $batch = 50 --> List) {
    my Int $start = 0;
    my @pages;
    while $start < @members {
        my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
        my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
        @pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
        $start += $batch;
    }
    @pages;
}

sub create-documents-from-pages(@pages, @members --> List) {
    my @documents;
    for @pages -> $page {
        my @quotations = $page<body>.split("\n")\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst("[", "[", :g))\
        .map(*.subst("]", "]", :g))\
        .map(*.subst("&amp;", "&", :g))\
        .map(*.subst("&nbsp;", "", :g))\
        .map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
        .grep(/^\*<-[*]>/)\
        .map(*.subst(/^\*\s+/, ""));

        # Note: The order of array wikiquote API returned is agnostic.
        my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
        @documents.push(%(body => $_, personid => $index)) for @quotations;
    }
    @documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}

my Str @members = get-members-from-category("Category:1954_births");
my @pages = get-pages(@members);
my @documents = create-documents-from-pages(@pages, @members);

my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;

my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;

First, we get the members listed in the “Category:1954_births” page. I choosed the year that the Perl 6 designer was born in:

my Str @members = get-members-from-category("Category:1954_births");

where get-members-from-category gets members via Wikiquote API:

sub get-members-from-category(Str $category --> List) {
    my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
    @(jget($member-url)<query><categorymembers>.map(*<title>));
}

Next, call get-pages:

my @pages = get-pages(@members);

get-pages is a subroutine that gets pages of the given titles (i.e., members):

sub get-pages(Str @members, Int $batch = 50 --> List) {
    my Int $start = 0;
    my @pages;
    while $start < @members {
        my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
        my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
        @pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
        $start += $batch;
    }
    @pages;
}

where @members[$start..^List($start + $batch, +@members).min] is a slice of length $batch, and the elements of the slice are percent encoded by uri_escase and joint by %7C (i.e., percent encoded pipe symbol).
In this case, one of the resulting $list is:

Mumia%20Abu-Jamal%7CRene%20Balcer%7CIain%20Banks%7CGerard%20Batten%7CChristie%20Brinkley%7CJames%20Cameron%20%28director%29%7CEugene%20Chadbourne%7CJackie%20Chan%7CChang%20Yu-hern%7CLee%20Child%7CHugo%20Ch%C3%A1vez%7CDon%20Coscarelli%7CElvis%20Costello%7CDaayiee%20Abdullah%7CThomas%20H.%20Davenport%7CGerardine%20DeSanctis%7CAl%20Di%20Meola%7CKevin%20Dockery%20%28author%29%7CJohn%20Doe%20%28musician%29%7CF.%20J.%20Duarte%7CIain%20Duncan%20Smith%7CHerm%20Edwards%7CAbdel%20Fattah%20el-Sisi%7CRob%20Enderle%7CRecep%20Tayyip%20Erdo%C4%9Fan%7CAlejandro%20Pe%C3%B1a%20Esclusa%7CHarvey%20Fierstein%7CCarly%20Fiorina%7CGary%20L.%20Francione%7CAshrita%20Furman%7CMary%20Gaitskill%7CGeorge%20Galloway%7C%C5%BDeljko%20Glasnovi%C4%87%7CGary%20Hamel%7CFran%C3%A7ois%20Hollande%7CKazuo%20Ishiguro%7CJean-Claude%20Juncker%7CAnish%20Kapoor%7CGuy%20Kawasaki%7CRobert%20Francis%20Kennedy%2C%20Jr.%7CLawrence%20M.%20Krauss%7CAnatoly%20Kudryavitsky%7CAnne%20Lamott%7CJoep%20Lange%7CAng%20Lee%7CLi%20Bin%7CRay%20Liotta%7CPeter%20Lipton%7CJames%20D.%20Macdonald%7CKen%20MacLeod

Note that get-pages subroutine uses hash contextualizer %() for creating a sequence of hash:

@pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });

After that, we call create-documents-from-pages:

my @documents = create-documents-from-pages(@pages, @members);

create-documents-from-pages creates documents from each page:

sub create-documents-from-pages(@pages, @members --> List) {
    my @documents;
    for @pages -> $page {
        my @quotations = $page<body>.split("\n")\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst("[", "[", :g))\
        .map(*.subst("]", "]", :g))\
        .map(*.subst("&amp;", "&", :g))\
        .map(*.subst("&nbsp;", "", :g))\
        .map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
        .grep(/^\*<-[*]>/)\
        .map(*.subst(/^\*\s+/, ""));

        # Note: The order of array wikiquote API returned is agnostic.
        my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
        @documents.push(%(body => $_, personid => $index)) for @quotations;
    }
    @documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}

where .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) and .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) are coverting commands that extract texts for displaying and delete texts for internal linking from anchor texts. For example, [[Perl]] is reduced into Perl. For more syntax info, see: https://docs.perl6.org/language/regexes#Named_captures or https://docs.perl6.org/routine/subst

After some cleaning operations (.e.g., .map(*.subst("[", "[", :g))), we extract quotation lines.
.grep(/^\*<-[*]>/) finds lines starting with single asterisk because most of the quotations appear in such kind of lines.

Next, .map(*.subst(/^\*\s+/, "")) deletes each asterisk since asterisk itself isn’t a constituent of each quotation.

Finally, we save the documents and members (i.e., titles):

my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;

my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;

Exercise 2: Execute LDA and Visualize the Result

In the previous section, we saved the cleaned documents.
In this section, we use the documents for LDA estimation and visualize the result.
The goal of this section is to plot a document-topic distribution and write a topic-word table.

The whole source code is:

use v6.c;
use Algorithm::LDA;
use Algorithm::LDA::Formatter;
use Algorithm::LDA::LDAModel;
use Chart::Gnuplot;
use Chart::Gnuplot::Subset;

sub create-model(@documents --> Algorithm::LDA::LDAModel) {
    my $stopwords = "stopwords/english".IO.lines.Set;
    my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
    my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
    my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
    my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
    $model
}

sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
    my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
    my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
    .grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;

    note("@documents[$docid] is selected");
    my ($row-size, $col-size) = $model.document-topic-matrix.shape;
    my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
    my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
    $gnu.command("set boxwidth 0.5 relative");
    my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
    $gnu.legend(:off);
    $gnu.xlabel(:label("Topic"));
    $gnu.ylabel(:label("P(z|theta,d)"));
    $gnu.xtics(:tics(@tics));
    $gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
    $gnu.dispose;
}

sub write-nbest($model) {
  my $topics := $model.nbest-words-per-topic(10);
  for ^(10/5) -> $part-i {
    say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
    say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
    for ^10 -> $rank {
        say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
            take @($topics)[$topic;$rank].key;
        }.join("|") ~ "|";
    }
    "".say;
  }
}

sub save-model($model) {
  my @document-topic-matrix := $model.document-topic-matrix;
  my ($document-size, $topic-size) = @document-topic-matrix.shape;
  my $doctopicfh = open "document-topic.txt", :w;

  $doctopicfh.say: ($document-size, $topic-size).join(" ");
  for ^$document-size -> $doc-i {
    $doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
  }
  $doctopicfh.close;

  my @topic-word-matrix := $model.topic-word-matrix;
  my ($, $word-size) = @topic-word-matrix.shape;
  my $topicwordfh = open "topic-word.txt", :w;

  $topicwordfh.say: ($topic-size, $word-size).join(" ");
  for ^$topic-size -> $topic-i {
    $topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
  }
  $topicwordfh.close;

  my @vocabulary := $model.vocabulary;
  my $vocabfh = open "vocabulary.txt", :w;

  $vocabfh.say($_) for @vocabulary;
  $vocabfh.close;
}

my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);
my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
write-nbest($model);
save-model($model);

First, we load the cleaned documents and call create-model:

my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);

create-model creates a LDA model by loading given documents:

sub create-model(@documents --> Algorithm::LDA::LDAModel) {
    my $stopwords = "stopwords/english".IO.lines.Set;
    my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
    my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
    my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
    my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
    $model
}

where $stopwords is a set of English stopwords from NLTK (I mentioned preliminary section), and &tokenizer is a custom tokenizer for Algorithm::LDA::Formatter.from-plain. The tokenizer converts given sentence as follows:

    1. Splits given sentence by whitespace and makes a list of tokens.
    1. Replaces each characters of the token with lower-case characters.
    1. Deletes token that exists in the stopwords list or one-length token categorized as Symbol or Punctuation.

Algorithm::LDA::Formatter.from-plain creates numerical native documents (i.e., each word in a document is mapped to its corresponding vocabulary id, and this id is represented by C int32) and vocabulary from a list of texts.

After creating an Algorithm::LDA instance using the above numerical documents, we can start LDA estimation by Algorithm::LDA.fit. In this example, we set the number of topics to 10, and the number of iterations to 100, and the seed for srand to 2018.

Next, we plot a document-topic distribution. Before this plotting, we load the saved members:

my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);

plot-topic-distribution plots topic distribution with Chart::Gnuplot:

sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
    my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
    my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
    .grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;

    note("@documents[$docid] is selected");
    my ($row-size, $col-size) = $model.document-topic-matrix.shape;
    my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
    my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
    $gnu.command("set boxwidth 0.5 relative");
    my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
    $gnu.legend(:off);
    $gnu.xlabel(:label("Topic"));
    $gnu.ylabel(:label("P(z|theta,d)"));
    $gnu.xtics(:tics(@tics));
    $gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
    $gnu.dispose;
}

In this example, we plot topic distribution of a Larry Wall’s quotation (“Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.”):
"Although the Perl Slogan is There's More Than One Way to Do It, I hesitate to make 10 ways to do something."

After the plotting, we call write-nbest:

write-nbest($model);

In LDA, what topic XXX represents is expressed as a list of words. write-nbest writes a markdown style topic-word distribution table:

sub write-nbest($model) {
  my $topics := $model.nbest-words-per-topic(10);
  for ^(10/5) -> $part-i {
    say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
    say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
    for ^10 -> $rank {
        say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
            take @($topics)[$topic;$rank].key;
        }.join("|") ~ "|";
    }
    "".say;
  }
}

The result is:

topic 0 topic 1 topic 2 topic 3 topic 4
would scotland black could one
it’s country mr. first work
believe one lot law new
one political play college human
took world official basic process
much need new speak business
don’t must reacher language becomes
ever national five every good
far many car matter world
fighting us road right knowledge
topic 5 topic 6 topic 7 topic 8 topic 9
apple united people like */
likely war would one die
company states i’m something und
jobs years know think quantum
even would think way play
steve american want things noble
life president get perl home
like human going long dog
end must say always student
small us go really ist

As you can see, the quotation of “Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.” contains “one”, “way” and “perl”. This is the reason why this quotation is mainly composed of topic 8.

For the next section, we save the model by save-model subroutine:

sub save-model($model) {
  my @document-topic-matrix := $model.document-topic-matrix;
  my ($document-size, $topic-size) = @document-topic-matrix.shape;
  my $doctopicfh = open "document-topic.txt", :w;

  $doctopicfh.say: ($document-size, $topic-size).join(" ");
  for ^$document-size -> $doc-i {
    $doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
  }
  $doctopicfh.close;

  my @topic-word-matrix := $model.topic-word-matrix;
  my ($, $word-size) = @topic-word-matrix.shape;
  my $topicwordfh = open "topic-word.txt", :w;

  $topicwordfh.say: ($topic-size, $word-size).join(" ");
  for ^$topic-size -> $topic-i {
    $topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
  }
  $topicwordfh.close;

  my @vocabulary := $model.vocabulary;
  my $vocabfh = open "vocabulary.txt", :w;

  $vocabfh.say($_) for @vocabulary;
  $vocabfh.close;
}

Exercise 3: Create Quotation Search Engine

In this section, we create a quotation search engine which uses the model created in the previous section.
More specifically, we create LDA-based document model (Xing Wei and W. Bruce Croft 2006) and make a CLI tool that can search quotations. (Note that the words “token” and “word” are interchangable in this section)

The whole source code is:

use v6.c;

sub MAIN(Str :$query!) {
    my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
    my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
    my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
    my ($, $word-size) = topic-word-iter.pull-one.words;

    my Num @document-topic[$document-size;$topic-size];
    my Num @topic-word[$topic-size;$word-size];

    for ^$document-size -> $doc-i {
        my \maybe-line := doc-topic-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @document-topic[$doc-i;$_] = @line[$_];
        }
    }

    for ^$topic-size -> $topic-i {
        my \maybe-line := topic-word-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @topic-word[$topic-i;$_] = @line[$_];
        }
    }

    my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
    my @members = "members.txt".IO.lines;
    my @documents = "documents.txt".IO.lines;
    my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
    my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;
    my @query = $query.words.map(*.lc);

    my @sorted-list = gather for ^$document-size -> $doc-i {
        my Num $log-prob = gather for @query -> $token {
            my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
            my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
            take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
        }.sum;
        take %(doc-i => $doc-i, log-prob => $log-prob);
    }.sort({ $^b<log-prob> <=> $^a<log-prob> });

    for ^10 {
        my $docid = @sorted-list[$_]<doc-i>;
        sprintf("\"%s\" by %s %f", @docbodies[$docid], @members[%doc-to-person{$docid}], @sorted-list[$_]<log-prob>).say;
    }

}

sub Pml(@docbodies, $doc-i, $token --> Num) {
    my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
    my Int $total-tokens = @docbodies[$doc-i].words.elems;
    return -100e0 if $total-tokens == 0 or $num-tokens == 0;
    log($num-tokens) - log($total-tokens);
}

sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
    gather for ^$topic-size -> $topic {
        if %vocabulary{$token}:exists {
            take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
        } else {
            take -100e0;
        }
    }.reduce(&log-sum);
}

sub log-sum(Num $log-a, Num $log-b --> Num) {
    if $log-a < $log-b {
        return $log-b + log(1 + exp($log-a - $log-b))
    } else {
        return $log-a + log(1 + exp($log-b - $log-a))
    }
}

At the beginning, we load the saved model and prepare @document-topic, @topic-word, %vocabulary, @documents, @docbodies, %doc-to-person and @members:

    my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
    my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
    my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
    my ($, $word-size) = topic-word-iter.pull-one.words;

    my Num @document-topic[$document-size;$topic-size];
    my Num @topic-word[$topic-size;$word-size];

    for ^$document-size -> $doc-i {
        my \maybe-line = doc-topic-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @document-topic[$doc-i;$_] = @line[$_];
        }
    }

    for ^$topic-size -> $topic-i {
        my \maybe-line = topic-word-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @topic-word[$topic-i;$_] = @line[$_];
        }
    }

    my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
    my @members = "members.txt".IO.lines;
    my @documents = "documents.txt".IO.lines;
    my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
    my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;

Next, we set @query using option :$query:

my @query = $query.words.map(*.lc);

After that, we compute the probability of P(query|document) based on Eq. 9 of the aforementioned paper (Note that we use logarithm to avoid undeflow and set the parameter mu to zero) and sort them.

    my @sorted-list = gather for ^$document-size -> $doc-i {
        my Num $log-prob = gather for @query -> $token {
            my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
            my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
            take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
        }.sum;
        take %(doc-i => $doc-i, log-prob => $log-prob);
    }.sort({ $^b<log-prob> <=> $^a<log-prob> });

Plda adds logarithmic topic given document probability (i.e., lnP(topic|theta,document)) and word given topic probability (i.e., lnP(word|phi,topic)) for each topic, and sums them by .reduce(&log-sum);:

sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
    gather for ^$topic-size -> $topic {
        if %vocabulary{$token}:exists {
            take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
        } else {
            take -100e0;
        }
    }.reduce(&log-sum);
}

and Pml (ml means Maximum Likelihood) counts $token and normalizes it by the number of the total tokens in the document (Note that this computation is also conducted in log space):

sub Pml(@docbodies, $doc-i, $token --> Num) {
    my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
    my Int $total-tokens = @docbodies[$doc-i].words.elems;
    return -100e0 if $total-tokens == 0 or $num-tokens == 0;
    log($num-tokens) - log($total-tokens);
}

OK, then let’s execute!

query “perl”:

$ perl6 search-quotation.p6 --query="perl"
"Perl will always provide the null." by Larry Wall -3.301156
"Perl programming is an *empirical* science!" by Larry Wall -3.345189
"The whole intent of Perl 5's module system was to encourage the growth of Perl culture rather than the Perl core." by Larry Wall -3.490238
"I dunno, I dream in Perl sometimes..." by Larry Wall -3.491790
"At many levels, Perl is a 'diagonal' language." by Larry Wall -3.575779
"Almost nothing in Perl serves a single purpose." by Larry Wall -3.589218
"Perl has a long tradition of working around compilers." by Larry Wall -3.674111
"As for whether Perl 6 will replace Perl 5, yeah, probably, in about 40 years or so." by Larry Wall -3.684454
"Well, I think Perl should run faster than C." by Larry Wall -3.771155
"It's certainly easy to calculate the average attendance for Perl conferences." by Larry Wall -3.864075

query “apple”:

$ perl6 search-quotation.p6 --query="apple"
"Steve Jobs is the"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.841538
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.904489
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -3.940359
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.152145
"[W]hen Apple wanted the name "iPhone" and it was owned by Cisco, Steve Jobs just took it, and his legal team executed so he could keep it. It turned out that doing this was surprisingly inexpensive. And, as the Apple Watch showcased, the Apple Phone likely would not have sold anywhere near as well as the iPhone." by Rob Enderle -4.187223
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.318575
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.380863
"Apple no longer owns the tablet market, and will likely lose dominance this year or next. ... this level of sustained dominance doesn't appear to recur with the same vendor even if it launched the category." by Rob Enderle -4.397954
"Apple is becoming more and more like a typical tech firm — that is, long on technology and short on magic. ... Apple is drifting closer and closer to where it was back in the 1990s. It offers advancements that largely follow those made by others years earlier, product proliferation, a preference for more over simple elegance, and waning excitement." by Rob Enderle -4.448473
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.469852 Ronald McDonald of Apple, he is the face." by Rob Enderle -3.822949
"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.849055
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.895163
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -4.052616
"*** The previous line contains the naughty word '$&'.\n if /(ibm|apple|awk)/; # :-)" by Larry Wall -4.088445
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.169533
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.197869
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.204618
"Today's tech companies aren't built to last, as Apple's recent earnings report shows all too well." by Rob Enderle -4.209901
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.238582

Conclusions

In this article, we explored Wikiquote and created a LDA model using Algoritm::LDA.
After that we built an information retrieval application.

Thanks for reading my article! See you next time!

Citations

  • Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.
  • Wei, Xing, and W. Bruce Croft. “LDA-based document models for ad-hoc retrieval.” Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.

License

Day 23 – Blin, it’s Christmas soon!

I’ve already mentioned Bisectable in one of the advent posts two years ago, but since then a lot has changed, so I think it’s time to give a brief history of the bisectable bot and its friends.

First of all, let’s define the problem that is being solved. Sometimes it happens that a commit introduces an unintended change in behavior (a bug). Usually we call that a regression, and in some cases the easiest way to figure out what went wrong and fix it is to first find which commit introduced the regression.

There are exactly 9000 commits between Rakudo 2015.12 and 2018.12, and even though it’s not over 9000, that’s still a lot.

Continue reading “Day 23 – Blin, it’s Christmas soon!”

Day 22 – Testing Cro HTTP APIs

A good amount of my work time this year has been spent on building a couple of Perl 6 applications. After a decade of contributing to Perl 6 compiler and runtime development, it feels great to finally be using it to deliver production solutions solving real-world problems. I’m still not sure whether writing code in an IDE I founded, using a HTTP library I designed, compiled by a compiler I implemented large parts of, and running on a VM that I play architect for, makes me one of the world’s worst cases of “Not Invented Here”, or just really Full Stack.

Whatever I’m working on, I highly value automated testing. Each passing test is something I know works – and something that I won’t break as I evolve the software in question. Even with automated tests, bugs happen, but adding a test to cover the bug at least means I’ll make different bugs in the future, which is perhaps a bit more forgivable.
Continue reading “Day 22 – Testing Cro HTTP APIs”

Day 21 – A Red Secret Santa

The year is ending and we have a lot to celebrate! What is a better way to celebrate the end of the year than with our family and friends? To help achieve that, here at my home, we decided to run a Secret Santa Game! So, my goal is to write a Secret Santa Program! That’s something where I can use this wonderful project called Red.

Red is an ORM (Object Relational Model) for perl6 still under development and not published as a module yet. But it’s growing and it is close to a release.

So let’s create our first table: a table that will store the people participating in our Secret Santa. To the code:

Continue reading “Day 21 – A Red Secret Santa”

Day 20 – Propositional combinatorics in Perl 6

Advent is an exciting time, a time of anticipation. And not only for us humans — it is the time when elves become most inventive. Today, I want to take some leisure time out of the Christmas stress to report about some pioneering work that is being done in the area of gift wrapping. Even if you didn’t anticipate any news from there, this report might still help you improve your technique, as — I don’t have to remind you — Christmas is approaching fast.

Do you know which presents small children like most? Large presents. Therefore, the Present Enlargement Research Lab at Northpole is tasked with finding practical ways to make presents larger. Now, “large” can mean multiple things. I will admit that the 6th unit is bending the meaning a bit, but their work is by far the most interesting: they increase the volume of presents, by increasing the dimension of the gift boxes.

Continue reading “Day 20 – Propositional combinatorics in Perl 6”

Day 19 – Interactive Desktop Apps

I am a big fan of roleplaying games like Dungeons and Dragons. Most of these games have screens to help you hide what you’re doing when running the game and give you some of the charts used in the game to reduce looking stuff up in the books.

My game collection is extensive though and I’d much rather use my laptop to not only hide behind and track information but I could also automate dice rolls and chart usage. Whilst I could cobble some stuff together with text editors and command line magic I’d much rather have some snazzy Desktops Apps that I can show off to people.

Enter GTK::Simple, a wrapper around the gtk3 UI library used by the Linux Gnome desktop but also available on Windows and Mac. This library gives you a simple to use interface, via the power of NativeCall, to let you create simple desktop applications.

Continue reading “Day 19 – Interactive Desktop Apps”

Day 18 – An SVG Christmas Tree

Christmas trees are a traditional symbol that date back more than four hundred years in Europe, so what could be better for an advent article than something about creating Christmas tree images.

The typical, simplified, representation of the tree is several triangles of decreasing size stacked on top of each other with a small over-lap, so are fairly easy to create with a computer program.

Here I’ll use Scalable Vector Graphics (SVG) to draw the image as, given the description above, it seems perfectly suited to the task.

Continue reading “Day 18 – An SVG Christmas Tree”

Day 17 – Compiling our way to happiness

Our mission, should we choose to accept it, is to solve the SEND + MORE = MONEY problem in code. No, hold on, let me put it like this instead:

    S E N D
+   M O R E
-----------
  M O N E Y

It means the same, but putting it up like this is more visually evocative, especially since many of us did it this way in school.

The ground rules are simple.

  • Each letter represents a digits between 0 and 9.
  • The letters represent distinct digits; two letters may not share the same digit.
  • Leading digits (in our puzzle, S and M) can’t be zero. Then they wouldn’t be leading digits!

Given these constraints, there’s a unique solution to the puzzle above.

I encourage you to find the solution. Write a bit of code, live a little! In this post, we’ll do that, but then (crucially) not be satisfied with that, and end up in a nested-doll situation where code writes code until something really neat emerges. The conclusion will spell out the ultimate vision — hold on, I’m being informed in real-time by the Plurality Committee that the correct term is “an ultimate vision” — for Perl 6.

Let’s do this.

Marcus Junius Brute Force (The Younger)

Our first language of the day, with its corresponding solution, is Perl 6 itself. There’s no finesse here; we just charge right through the solution space like an enraged bull, trying everything. In fact, we make sure not to attempt any cleverness with this one, just try to express the solution as straightforwardly as possible.

for 0..9 -> int $d {
    for 0..9 -> int $e {
        next if $e == $d;

        my int $y = ($d + $e) % 10;
        my int $_c1 = ($d + $e) div 10;

        for 0..9 -> int $n {
            next if $n == $d;
            next if $n == $e;
            next if $n == $y;

            for 0..9 -> int $r {
                next if $r == $d;
                next if $r == $e;
                next if $r == $y;
                next if $r == $n;

                next unless ($_c1 + $n + $r) % 10 == $e;
                my int $_c2 = ($_c1 + $n + $r) div 10;

                for 0..9 -> int $o {
                    next if $o == $d;
                    next if $o == $e;
                    next if $o == $y;
                    next if $o == $n;
                    next if $o == $r;

                    next unless ($_c2 + $e + $o) % 10 == $n;
                    my int $_c3 = ($_c2 + $e + $o) div 10;

                    for 1..9 -> int $s {
                        next if $s == $d;
                        next if $s == $e;
                        next if $s == $y;
                        next if $s == $n;
                        next if $s == $r;
                        next if $s == $o;

                        for 1..9 -> int $m {
                            next if $m == $d;
                            next if $m == $e;
                            next if $m == $y;
                            next if $m == $n;
                            next if $m == $r;
                            next if $m == $o;
                            next if $m == $s;

                            next unless ($_c3 + $s + $m) % 10 == $o;
                            my int $_c4 = ($_c3 + $s + $m) div 10;

                            next unless $_c4 % 10 == $m;

                            say "$s$e$n$d + $m$o$r$e == $m$o$n$e$y";
                        }
                    }
                }
            }
        }
    }
}

Again, it’s not pretty, but it works. This is the kind of indentation level your mother warned you about. If you ask me, though, I’m more annoyed about the indentation being there at all. We have one for every variable whose search space we need to scan through. (Only with Y do we get to take a shortcut.)

Though it’s a detour for today’s buffet, MJD once blogged about this and then I blogged about it too. Those blog posts were very much about “removing the indentation”, in a sense. Today’s post is where my thinking has taken me, three years later.

I took the path less traveled (and all the other paths, too)

Our second language is still mostly Perl 6, but with a neat hypothetical extension called amb, but spelled (evocatively) <-. It gets rid of all the explicit for loops and levels of indentation.

my $d <- 0..9;
my $e <- 0..9;
guard $e != any($d);
my $y = ($d + $e) % 10;
my $_c1 = ($d + $e) div 10;

my $n <- 0..9;
guard $n != any($d, $e, $y);
my $r <- 0..9;
guard $r != any($d, $e, $y, $n);
guard ($_c1 + $n + $r) % 10 == $e;
my $_c2 = ($_c1 + $n + $r) div 10;

my $o <- 0..9;
guard $o != any($d, $e, $y, $n, $r);
guard ($_c2 + $e + $o) % 10 == $n;
my $_c3 = ($_c2 + $e + $o) div 10;

my $s <- 1..9;
guard $s != any($d, $e, $y, $n, $r, $o);
my $m <- 1..9;
guard $m != any($d, $e, $y, $n, $r, $o, $s);
guard ($_c3 + $s + $m) % 10 == $o;
my $_c4 = ($_c3 + $s + $m) div 10;

guard $_c4 % 10 == $m;

say "$s$e$n$d + $m$o$r$e == $m$o$n$e$y";

This solution is shorter, more compact, and feels less “noisy” and aggravating just by ridding us of the for loops. (I suspect this has something to do with that imperative↔declarative spectrum people mention sometimes. We’re not so interested in looping as such, only seeing it get done.)

I know it won’t completely make up for the fact that Perl 6 doesn’t have the amb operator and guard implemented in core (or even in module space), but here’s a short script that will convert the above program to today’s first version:

my $indent = 0;
constant SPACE = chr(0x20);
sub indent { SPACE x 4 * $indent }

for lines() {
    when /^ my \h+ ('$' \w) \h* '<-' \h* (\d+ \h* '..' \h* \d+) ';' $/ {
        say indent, "for $1 -> int $0 \{";
        $indent++;
    }

    when /^ guard \h+ ('$' \w) \h* '!=' \h* 'any(' ('$' \w)+ % [\h* ',' \h*] ')' \h* ';' $/ {
        say indent, "next if $0 == $_;"
            for $1;
        say "";
    }

    when /^ guard \h+ ([<!before '=='> .]+ '==' <-[;]>+) ';' $/ {
        say indent, "next unless $0;";
    }

    when /^ my \h+ ('$' \w+) \h* '=' \h* (<-[;]>+) ';' $/ {
        say indent, "my int $0 = $1;";
    }

    when /^ \h* $/ {
        say "";
    }

    when /^ say \h+ (<-[;]>+) ';' $/ {
        say indent, $_;
    }

    default {
        die "Couldn't match $_";
    }
}

while $indent-- {
    say indent, "\}";
}

But we’ll not be satisfied here either. Oh no.

Thinking in equations

The third language takes us even further into the declarative, getting rid of all the guard clauses that simply state that the variables should be distinct.

ALL_DISTINCT

$d in 0..9
$e in 0..9
$n in 0..9
$r in 0..9
$o in 0..9
$s in 1..9
$m in 1..9

$y = ($d + $e) % 10
$_c1 = ($d + $e) div 10

($_c1 + $n + $r) % 10 == $e
$_c2 = ($_c1 + $n + $r) div 10

($_c2 + $e + $o) % 10 == $n
$_c3 = ($_c2 + $e + $o) div 10

($_c3 + $s + $m) % 10 == $o
$_c4 = ($_c3 + $s + $m) div 10

$_c4 % 10 == $m

We’re completely in the domain of constraint programming now, and it would be disingenuous not to mention this. We’ve left the imperative aspects of Perl 6 behind, and we’re focusing solely on describing the constraints of the problem we’re solving.

The most imperative aspect of the above program is when we do an assignment. Even this is mostly an optimization, in the cases when we know we can compute the value of a variable directly instead of searching for it.

Even in this case, we could translate back to the previous solution. I’ll leave out such a translator for now, though.

I’m going to come back to this language in the conclusion, because it turns out in many ways, it’s the most interesting one.

The fourth language

Having gotten this far, what more imperative complexity can we peel off? Specifically, where do those equations come from that are specified in the previous solution? How can we express them more succinctly?

You’ll like this, I think. The fourth language just expresses the search like this:

    S E N D
+   M O R E
-----------
  M O N E Y

Hang on, what again? Yes, you read that right. The most declarative solution to this problem is just an ASCII layout of the problem specification itself! Don’t you just love it when the problem space and the solution space meet up like that?

From this layout, we can again translate back to the constraint programming solution, weaving equations out of the manual algorithm for addition that we learn in school.

So, not only don’t we have to write those aggravating for loops; if we’re tenacious enough, we can have code generation all the way from the problem to the solution. We just need to find the appropriate languages to land on in-between.

Conclusion

My exploration with 007 has led me to think about things like the above: translating programs. Perl 6 already exposes one part of the compilation process very well: parsing. We can use grammars both in userland and within the Perl 6 toolchain itself.

I’ve come to believe we need to do that to all aspects of the compilation pipeline. Here, let me put it as a slogan or a declaration of sorts:

Perl 6 will have reached its full potential when all features we bring to bear manipulating text/data can also be turned inwards, to the compilation process itself.

Those translators I wrote (or imagined) between my different languages, they work in a pinch but they’re also fragile and a bit of a waste. The problem is to a large extent that we drop down to text all the time. We should be doing this at the AST level, where all the structure is readily available.

The gains from such a mind shift cannot be overstated. This is where we will find Lispy enlightenment in Perl 6.

For example, the third language with the equations doesn’t have to be blindly translated into code. It can be optimized, the equations massaged into narrower and more precise ones. As can be seen on Wikipedia, it’s possible to do such a good job of optimizing that there’s no searching left once the program runs.

My dream: to be able to do the above transformations, not between text files but between slangs within Perl 6. And to be able to do the optimization step as well. All without leaving the comfort of the language.

Day 15 – Building a (little) Spacecraft with Perl 6

Showing off long ears

In the previous post, we have encountered some kind of special elves’ magic:

enum Fuel <Solid Liquid Gas>;

class SpeedChoice does ASNChoice {
    method ASN-choice { { mph => (1 => Int), kmph => (0 => Int) } }
}

class Rocket does ASNSequence {
    has Str $.name is UTF8String;
    has Str $.message is default-value("Hello World") is UTF8String;
    has Fuel $.fuel;
    has SpeedChoice $.speed is optional;
    has Str @.payload is UTF8String;

    method ASN-order { <$!name $!message $!fuel $!speed @!payload> }
}

my $rocket = Rocket.new(
    name => 'Falcon',
    fuel => Solid,
    speed => SpeedChoice.new((mph => 18000)),
    payload => [ "Car", "GPS" ]);

my $rocket-bytes = ASN::Serializer.serialize($rocket, :mode(Implicit));

#`[ Result:
      Blob.new(
          0x30, 0x1B, # Outermost SEQUENCE
          0x0C, 0x06, 0x46, 0x61, 0x6C, 0x63, 0x6F, 0x6E, # NAME, MESSAGE is missing
          0x0A, 0x01, 0x00, # ENUMERATED
          0x81, 0x02, 0x46, 0x50, # CHOICE
          0x30, 0x0A, # SEQUENCE OF UTF8String
              0x0C, 0x03, 0x43, 0x61, 0x72,  # UTF8String
              0x0C, 0x03, 0x47, 0x50, 0x53); # UTF8String
]

say ASN::Parser.new(:type(Rocket)).parse($rocket-bytes) eqv $rocket; # Certainly true!

However, as an elf I know once quoted, ‘It’s hard to tell the difference between mastered technique and magic’. So the mystery can be resolved? Let’s glance over the way it all works.

Continue reading “Day 15 – Building a (little) Spacecraft with Perl 6”