Day 24 – Topic Modeling with Perl 6

Hi, everyone.

Today, let me introduce Algorithm::LDA.
This module is a Latent Dirichlet Allocation (i.e., LDA) implementation for topic modeling.

Introduction

What’s LDA? LDA is one of the popular unsupervised machine learning methods.
It models document generation process and represents each document as a mixture of topics.

So, what does “a mixture of topics” mean? Fig. 1 shows an article in which some of the words are highlighted in three colors: yellow, pink, and blue. Words about genetics are marked in yellow; words about evolutionary biology are marked in pink; words about data analysis are marked in blue. Imagine all of the words in this article are colored, then we can represent this article as a mixture of topics (i.e., colors).

Fig. 1:
Fig. 1
(This image is from “Probabilistic topic models.” (David Blei 2012))

OK, then I’ll demonstrate how to use Algorithm::LDA in the next section.

Modeling Quotations

In this article, we explore Wikiquote. Wikiquote is a cloud-sourced platform providing sourced quotations.
By using Wikiquote API, we get quotations that are used for LDA estimation. After that, we execute LDA and plot the result.
Finally, we create an information retrieval application using the resulting model.

Preliminary

Wikiquote API

Wikiquote has action API that provides means for getting Wikiquote resources.
For example, you can get content of the Main Page as follows:

$ curl "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content&format=json"

The result of the above command is:

{"batchcomplete":"","warnings":{"main":{"*":"Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application."},"revisions":{"*":"Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."}},"query":{"pages":{"1":{"pageid":1,"ns":0,"title":"Main Page","revisions":[{"contentformat":"text/x-wiki","contentmodel":"wikitext","*":"
\n
{{Main page header}}
\n
{{Main Page Quote of the day}}
\n</div>\n\n
\n{{Main Page Selected pages}}\n{{Main categories}}\n
\n\n
\n{{New pages}}\n{{Main Page Community}}\n
\n\n
\n==Wikiquote's sister projects==\n{{otherwiki}}\n\n==Wikiquote languages==\n{{Wikiquotelang}}\n
\n__NOTOC__ __NOEDITSECTION__\n{{noexternallanglinks:ang|simple}}\n[[Category:Main page]]"}]}}}}

WWW

WWW by Zoffix Znet is a library which provides easy-to-use API for fetching and parsing json very simply.
For instance, as the README says, you can easily get content by jget(URL)<HASHKEY> style:

say jget('https://httpbin.org/get?foo=42&bar=x')<args><foo>;

To install WWW:

$ zef install WWW

Chart::Gnuplot

Chart::Gnuplot by titsuki is a bindings for gnuplot.

To install Chart::Gnuplot:

$ zef install Chart::Gnuplot

In this article, we use this module; however, if you unfamiliar with gnuplot there are many alternatives: SVG::Plot, Graphics::PLplot, Call matplotlib functions via Inline::Python.

Stopwords from NLTK

NLTK is a toolkit for natural language processing.
Not only APIs, it also provides corpus.
You can get stopwords for English via “70. Stopwords Corpus”: http://www.nltk.org/nltk_data/

Exercise 1: Get Quotations and Create Cleaned Documents

At the beginning, we have to get quotations from Wikiquote and create clean documents.

The main goal of this section is to create documents according to the following format:

<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...

The whole source code is:

use v6.c;
use WWW;
use URI::Escape;

sub get-members-from-category(Str $category --> List) {
    my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
    @(jget($member-url)<query><categorymembers>.map(*<title>));
}

sub get-pages(Str @members, Int $batch = 50 --> List) {
    my Int $start = 0;
    my @pages;
    while $start < @members {
        my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
        my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
        @pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
        $start += $batch;
    }
    @pages;
}

sub create-documents-from-pages(@pages, @members --> List) {
    my @documents;
    for @pages -> $page {
        my @quotations = $page<body>.split("\n")\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst("[", "[", :g))\
        .map(*.subst("]", "]", :g))\
        .map(*.subst("&amp;", "&", :g))\
        .map(*.subst("&nbsp;", "", :g))\
        .map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
        .grep(/^\*<-[*]>/)\
        .map(*.subst(/^\*\s+/, ""));

        # Note: The order of array wikiquote API returned is agnostic.
        my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
        @documents.push(%(body => $_, personid => $index)) for @quotations;
    }
    @documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}

my Str @members = get-members-from-category("Category:1954_births");
my @pages = get-pages(@members);
my @documents = create-documents-from-pages(@pages, @members);

my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;

my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;

First, we get the members listed in the “Category:1954_births” page. I choosed the year that the Perl 6 designer was born in:

my Str @members = get-members-from-category("Category:1954_births");

where get-members-from-category gets members via Wikiquote API:

sub get-members-from-category(Str $category --> List) {
    my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
    @(jget($member-url)<query><categorymembers>.map(*<title>));
}

Next, call get-pages:

my @pages = get-pages(@members);

get-pages is a subroutine that gets pages of the given titles (i.e., members):

sub get-pages(Str @members, Int $batch = 50 --> List) {
    my Int $start = 0;
    my @pages;
    while $start < @members {
        my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
        my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
        @pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
        $start += $batch;
    }
    @pages;
}

where @members[$start..^List($start + $batch, +@members).min] is a slice of length $batch, and the elements of the slice are percent encoded by uri_escase and joint by %7C (i.e., percent encoded pipe symbol).
In this case, one of the resulting $list is:

Mumia%20Abu-Jamal%7CRene%20Balcer%7CIain%20Banks%7CGerard%20Batten%7CChristie%20Brinkley%7CJames%20Cameron%20%28director%29%7CEugene%20Chadbourne%7CJackie%20Chan%7CChang%20Yu-hern%7CLee%20Child%7CHugo%20Ch%C3%A1vez%7CDon%20Coscarelli%7CElvis%20Costello%7CDaayiee%20Abdullah%7CThomas%20H.%20Davenport%7CGerardine%20DeSanctis%7CAl%20Di%20Meola%7CKevin%20Dockery%20%28author%29%7CJohn%20Doe%20%28musician%29%7CF.%20J.%20Duarte%7CIain%20Duncan%20Smith%7CHerm%20Edwards%7CAbdel%20Fattah%20el-Sisi%7CRob%20Enderle%7CRecep%20Tayyip%20Erdo%C4%9Fan%7CAlejandro%20Pe%C3%B1a%20Esclusa%7CHarvey%20Fierstein%7CCarly%20Fiorina%7CGary%20L.%20Francione%7CAshrita%20Furman%7CMary%20Gaitskill%7CGeorge%20Galloway%7C%C5%BDeljko%20Glasnovi%C4%87%7CGary%20Hamel%7CFran%C3%A7ois%20Hollande%7CKazuo%20Ishiguro%7CJean-Claude%20Juncker%7CAnish%20Kapoor%7CGuy%20Kawasaki%7CRobert%20Francis%20Kennedy%2C%20Jr.%7CLawrence%20M.%20Krauss%7CAnatoly%20Kudryavitsky%7CAnne%20Lamott%7CJoep%20Lange%7CAng%20Lee%7CLi%20Bin%7CRay%20Liotta%7CPeter%20Lipton%7CJames%20D.%20Macdonald%7CKen%20MacLeod

Note that get-pages subroutine uses hash contextualizer %() for creating a sequence of hash:

@pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });

After that, we call create-documents-from-pages:

my @documents = create-documents-from-pages(@pages, @members);

create-documents-from-pages creates documents from each page:

sub create-documents-from-pages(@pages, @members --> List) {
    my @documents;
    for @pages -> $page {
        my @quotations = $page<body>.split("\n")\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
        .map(*.subst("[", "[", :g))\
        .map(*.subst("]", "]", :g))\
        .map(*.subst("&amp;", "&", :g))\
        .map(*.subst("&nbsp;", "", :g))\
        .map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
        .grep(/^\*<-[*]>/)\
        .map(*.subst(/^\*\s+/, ""));

        # Note: The order of array wikiquote API returned is agnostic.
        my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
        @documents.push(%(body => $_, personid => $index)) for @quotations;
    }
    @documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}

where .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) and .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) are coverting commands that extract texts for displaying and delete texts for internal linking from anchor texts. For example, [[Perl]] is reduced into Perl. For more syntax info, see: https://docs.perl6.org/language/regexes#Named_captures or https://docs.perl6.org/routine/subst

After some cleaning operations (.e.g., .map(*.subst("[", "[", :g))), we extract quotation lines.
.grep(/^\*<-[*]>/) finds lines starting with single asterisk because most of the quotations appear in such kind of lines.

Next, .map(*.subst(/^\*\s+/, "")) deletes each asterisk since asterisk itself isn’t a constituent of each quotation.

Finally, we save the documents and members (i.e., titles):

my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;

my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;

Exercise 2: Execute LDA and Visualize the Result

In the previous section, we saved the cleaned documents.
In this section, we use the documents for LDA estimation and visualize the result.
The goal of this section is to plot a document-topic distribution and write a topic-word table.

The whole source code is:

use v6.c;
use Algorithm::LDA;
use Algorithm::LDA::Formatter;
use Algorithm::LDA::LDAModel;
use Chart::Gnuplot;
use Chart::Gnuplot::Subset;

sub create-model(@documents --> Algorithm::LDA::LDAModel) {
    my $stopwords = "stopwords/english".IO.lines.Set;
    my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
    my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
    my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
    my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
    $model
}

sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
    my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
    my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
    .grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;

    note("@documents[$docid] is selected");
    my ($row-size, $col-size) = $model.document-topic-matrix.shape;
    my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
    my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
    $gnu.command("set boxwidth 0.5 relative");
    my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
    $gnu.legend(:off);
    $gnu.xlabel(:label("Topic"));
    $gnu.ylabel(:label("P(z|theta,d)"));
    $gnu.xtics(:tics(@tics));
    $gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
    $gnu.dispose;
}

sub write-nbest($model) {
  my $topics := $model.nbest-words-per-topic(10);
  for ^(10/5) -> $part-i {
    say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
    say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
    for ^10 -> $rank {
        say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
            take @($topics)[$topic;$rank].key;
        }.join("|") ~ "|";
    }
    "".say;
  }
}

sub save-model($model) {
  my @document-topic-matrix := $model.document-topic-matrix;
  my ($document-size, $topic-size) = @document-topic-matrix.shape;
  my $doctopicfh = open "document-topic.txt", :w;

  $doctopicfh.say: ($document-size, $topic-size).join(" ");
  for ^$document-size -> $doc-i {
    $doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
  }
  $doctopicfh.close;

  my @topic-word-matrix := $model.topic-word-matrix;
  my ($, $word-size) = @topic-word-matrix.shape;
  my $topicwordfh = open "topic-word.txt", :w;

  $topicwordfh.say: ($topic-size, $word-size).join(" ");
  for ^$topic-size -> $topic-i {
    $topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
  }
  $topicwordfh.close;

  my @vocabulary := $model.vocabulary;
  my $vocabfh = open "vocabulary.txt", :w;

  $vocabfh.say($_) for @vocabulary;
  $vocabfh.close;
}

my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);
my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
write-nbest($model);
save-model($model);

First, we load the cleaned documents and call create-model:

my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);

create-model creates a LDA model by loading given documents:

sub create-model(@documents --> Algorithm::LDA::LDAModel) {
    my $stopwords = "stopwords/english".IO.lines.Set;
    my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
    my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
    my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
    my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
    $model
}

where $stopwords is a set of English stopwords from NLTK (I mentioned preliminary section), and &tokenizer is a custom tokenizer for Algorithm::LDA::Formatter.from-plain. The tokenizer converts given sentence as follows:

    1. Splits given sentence by whitespace and makes a list of tokens.
    1. Replaces each characters of the token with lower-case characters.
    1. Deletes token that exists in the stopwords list or one-length token categorized as Symbol or Punctuation.

Algorithm::LDA::Formatter.from-plain creates numerical native documents (i.e., each word in a document is mapped to its corresponding vocabulary id, and this id is represented by C int32) and vocabulary from a list of texts.

After creating an Algorithm::LDA instance using the above numerical documents, we can start LDA estimation by Algorithm::LDA.fit. In this example, we set the number of topics to 10, and the number of iterations to 100, and the seed for srand to 2018.

Next, we plot a document-topic distribution. Before this plotting, we load the saved members:

my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);

plot-topic-distribution plots topic distribution with Chart::Gnuplot:

sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
    my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
    my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
    .grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;

    note("@documents[$docid] is selected");
    my ($row-size, $col-size) = $model.document-topic-matrix.shape;
    my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
    my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
    $gnu.command("set boxwidth 0.5 relative");
    my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
    $gnu.legend(:off);
    $gnu.xlabel(:label("Topic"));
    $gnu.ylabel(:label("P(z|theta,d)"));
    $gnu.xtics(:tics(@tics));
    $gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
    $gnu.dispose;
}

In this example, we plot topic distribution of a Larry Wall’s quotation (“Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.”):
"Although the Perl Slogan is There's More Than One Way to Do It, I hesitate to make 10 ways to do something."

After the plotting, we call write-nbest:

write-nbest($model);

In LDA, what topic XXX represents is expressed as a list of words. write-nbest writes a markdown style topic-word distribution table:

sub write-nbest($model) {
  my $topics := $model.nbest-words-per-topic(10);
  for ^(10/5) -> $part-i {
    say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
    say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
    for ^10 -> $rank {
        say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
            take @($topics)[$topic;$rank].key;
        }.join("|") ~ "|";
    }
    "".say;
  }
}

The result is:

topic 0 topic 1 topic 2 topic 3 topic 4
would scotland black could one
it’s country mr. first work
believe one lot law new
one political play college human
took world official basic process
much need new speak business
don’t must reacher language becomes
ever national five every good
far many car matter world
fighting us road right knowledge
topic 5 topic 6 topic 7 topic 8 topic 9
apple united people like */
likely war would one die
company states i’m something und
jobs years know think quantum
even would think way play
steve american want things noble
life president get perl home
like human going long dog
end must say always student
small us go really ist

As you can see, the quotation of “Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.” contains “one”, “way” and “perl”. This is the reason why this quotation is mainly composed of topic 8.

For the next section, we save the model by save-model subroutine:

sub save-model($model) {
  my @document-topic-matrix := $model.document-topic-matrix;
  my ($document-size, $topic-size) = @document-topic-matrix.shape;
  my $doctopicfh = open "document-topic.txt", :w;

  $doctopicfh.say: ($document-size, $topic-size).join(" ");
  for ^$document-size -> $doc-i {
    $doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
  }
  $doctopicfh.close;

  my @topic-word-matrix := $model.topic-word-matrix;
  my ($, $word-size) = @topic-word-matrix.shape;
  my $topicwordfh = open "topic-word.txt", :w;

  $topicwordfh.say: ($topic-size, $word-size).join(" ");
  for ^$topic-size -> $topic-i {
    $topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
  }
  $topicwordfh.close;

  my @vocabulary := $model.vocabulary;
  my $vocabfh = open "vocabulary.txt", :w;

  $vocabfh.say($_) for @vocabulary;
  $vocabfh.close;
}

Exercise 3: Create Quotation Search Engine

In this section, we create a quotation search engine which uses the model created in the previous section.
More specifically, we create LDA-based document model (Xing Wei and W. Bruce Croft 2006) and make a CLI tool that can search quotations. (Note that the words “token” and “word” are interchangable in this section)

The whole source code is:

use v6.c;

sub MAIN(Str :$query!) {
    my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
    my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
    my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
    my ($, $word-size) = topic-word-iter.pull-one.words;

    my Num @document-topic[$document-size;$topic-size];
    my Num @topic-word[$topic-size;$word-size];

    for ^$document-size -> $doc-i {
        my \maybe-line := doc-topic-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @document-topic[$doc-i;$_] = @line[$_];
        }
    }

    for ^$topic-size -> $topic-i {
        my \maybe-line := topic-word-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @topic-word[$topic-i;$_] = @line[$_];
        }
    }

    my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
    my @members = "members.txt".IO.lines;
    my @documents = "documents.txt".IO.lines;
    my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
    my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;
    my @query = $query.words.map(*.lc);

    my @sorted-list = gather for ^$document-size -> $doc-i {
        my Num $log-prob = gather for @query -> $token {
            my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
            my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
            take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
        }.sum;
        take %(doc-i => $doc-i, log-prob => $log-prob);
    }.sort({ $^b<log-prob> <=> $^a<log-prob> });

    for ^10 {
        my $docid = @sorted-list[$_]<doc-i>;
        sprintf("\"%s\" by %s %f", @docbodies[$docid], @members[%doc-to-person{$docid}], @sorted-list[$_]<log-prob>).say;
    }

}

sub Pml(@docbodies, $doc-i, $token --> Num) {
    my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
    my Int $total-tokens = @docbodies[$doc-i].words.elems;
    return -100e0 if $total-tokens == 0 or $num-tokens == 0;
    log($num-tokens) - log($total-tokens);
}

sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
    gather for ^$topic-size -> $topic {
        if %vocabulary{$token}:exists {
            take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
        } else {
            take -100e0;
        }
    }.reduce(&log-sum);
}

sub log-sum(Num $log-a, Num $log-b --> Num) {
    if $log-a < $log-b {
        return $log-b + log(1 + exp($log-a - $log-b))
    } else {
        return $log-a + log(1 + exp($log-b - $log-a))
    }
}

At the beginning, we load the saved model and prepare @document-topic, @topic-word, %vocabulary, @documents, @docbodies, %doc-to-person and @members:

    my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
    my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
    my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
    my ($, $word-size) = topic-word-iter.pull-one.words;

    my Num @document-topic[$document-size;$topic-size];
    my Num @topic-word[$topic-size;$word-size];

    for ^$document-size -> $doc-i {
        my \maybe-line = doc-topic-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @document-topic[$doc-i;$_] = @line[$_];
        }
    }

    for ^$topic-size -> $topic-i {
        my \maybe-line = topic-word-iter.pull-one;
        die "Error: Something went wrong" if maybe-line =:= IterationEnd;
        my Num @line = @(maybe-line).words>>.Num;
        for ^@line {
            @topic-word[$topic-i;$_] = @line[$_];
        }
    }

    my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
    my @members = "members.txt".IO.lines;
    my @documents = "documents.txt".IO.lines;
    my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
    my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;

Next, we set @query using option :$query:

my @query = $query.words.map(*.lc);

After that, we compute the probability of P(query|document) based on Eq. 9 of the aforementioned paper (Note that we use logarithm to avoid undeflow and set the parameter mu to zero) and sort them.

    my @sorted-list = gather for ^$document-size -> $doc-i {
        my Num $log-prob = gather for @query -> $token {
            my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
            my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
            take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
        }.sum;
        take %(doc-i => $doc-i, log-prob => $log-prob);
    }.sort({ $^b<log-prob> <=> $^a<log-prob> });

Plda adds logarithmic topic given document probability (i.e., lnP(topic|theta,document)) and word given topic probability (i.e., lnP(word|phi,topic)) for each topic, and sums them by .reduce(&log-sum);:

sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
    gather for ^$topic-size -> $topic {
        if %vocabulary{$token}:exists {
            take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
        } else {
            take -100e0;
        }
    }.reduce(&log-sum);
}

and Pml (ml means Maximum Likelihood) counts $token and normalizes it by the number of the total tokens in the document (Note that this computation is also conducted in log space):

sub Pml(@docbodies, $doc-i, $token --> Num) {
    my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
    my Int $total-tokens = @docbodies[$doc-i].words.elems;
    return -100e0 if $total-tokens == 0 or $num-tokens == 0;
    log($num-tokens) - log($total-tokens);
}

OK, then let’s execute!

query “perl”:

$ perl6 search-quotation.p6 --query="perl"
"Perl will always provide the null." by Larry Wall -3.301156
"Perl programming is an *empirical* science!" by Larry Wall -3.345189
"The whole intent of Perl 5's module system was to encourage the growth of Perl culture rather than the Perl core." by Larry Wall -3.490238
"I dunno, I dream in Perl sometimes..." by Larry Wall -3.491790
"At many levels, Perl is a 'diagonal' language." by Larry Wall -3.575779
"Almost nothing in Perl serves a single purpose." by Larry Wall -3.589218
"Perl has a long tradition of working around compilers." by Larry Wall -3.674111
"As for whether Perl 6 will replace Perl 5, yeah, probably, in about 40 years or so." by Larry Wall -3.684454
"Well, I think Perl should run faster than C." by Larry Wall -3.771155
"It's certainly easy to calculate the average attendance for Perl conferences." by Larry Wall -3.864075

query “apple”:

$ perl6 search-quotation.p6 --query="apple"
"Steve Jobs is the"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.841538
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.904489
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -3.940359
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.152145
"[W]hen Apple wanted the name "iPhone" and it was owned by Cisco, Steve Jobs just took it, and his legal team executed so he could keep it. It turned out that doing this was surprisingly inexpensive. And, as the Apple Watch showcased, the Apple Phone likely would not have sold anywhere near as well as the iPhone." by Rob Enderle -4.187223
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.318575
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.380863
"Apple no longer owns the tablet market, and will likely lose dominance this year or next. ... this level of sustained dominance doesn't appear to recur with the same vendor even if it launched the category." by Rob Enderle -4.397954
"Apple is becoming more and more like a typical tech firm — that is, long on technology and short on magic. ... Apple is drifting closer and closer to where it was back in the 1990s. It offers advancements that largely follow those made by others years earlier, product proliferation, a preference for more over simple elegance, and waning excitement." by Rob Enderle -4.448473
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.469852 Ronald McDonald of Apple, he is the face." by Rob Enderle -3.822949
"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.849055
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.895163
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -4.052616
"*** The previous line contains the naughty word '$&'.\n if /(ibm|apple|awk)/; # :-)" by Larry Wall -4.088445
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.169533
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.197869
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.204618
"Today's tech companies aren't built to last, as Apple's recent earnings report shows all too well." by Rob Enderle -4.209901
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.238582

Conclusions

In this article, we explored Wikiquote and created a LDA model using Algoritm::LDA.
After that we built an information retrieval application.

Thanks for reading my article! See you next time!

Citations

  • Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.
  • Wei, Xing, and W. Bruce Croft. “LDA-based document models for ad-hoc retrieval.” Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.

License