Hi, everyone.
Today, let me introduce Algorithm::LDA.
This module is a Latent Dirichlet Allocation (i.e., LDA) implementation for topic modeling.
Introduction
What’s LDA? LDA is one of the popular unsupervised machine learning methods.
It models document generation process and represents each document as a mixture of topics.
So, what does “a mixture of topics” mean? Fig. 1 shows an article in which some of the words are highlighted in three colors: yellow, pink, and blue. Words about genetics are marked in yellow; words about evolutionary biology are marked in pink; words about data analysis are marked in blue. Imagine all of the words in this article are colored, then we can represent this article as a mixture of topics (i.e., colors).
Fig. 1:
(This image is from “Probabilistic topic models.” (David Blei 2012))
OK, then I’ll demonstrate how to use Algorithm::LDA in the next section.
Modeling Quotations
In this article, we explore Wikiquote. Wikiquote is a cloud-sourced platform providing sourced quotations.
By using Wikiquote API, we get quotations that are used for LDA estimation. After that, we execute LDA and plot the result.
Finally, we create an information retrieval application using the resulting model.
Preliminary
Wikiquote API
Wikiquote has action API that provides means for getting Wikiquote resources.
For example, you can get content of the Main Page as follows:
$ curl "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content&format=json"
The result of the above command is:
{"batchcomplete":"","warnings":{"main":{"*":"Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application."},"revisions":{"*":"Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."}},"query":{"pages":{"1":{"pageid":1,"ns":0,"title":"Main Page","revisions":[{"contentformat":"text/x-wiki","contentmodel":"wikitext","*":"\n{{Main page header}}\n{{Main Page Quote of the day}}\n</div>\n\n \n{{Main Page Selected pages}}\n{{Main categories}}\n\n\n \n{{New pages}}\n{{Main Page Community}}\n\n\n\n==Wikiquote's sister projects==\n{{otherwiki}}\n\n==Wikiquote languages==\n{{Wikiquotelang}}\n\n__NOTOC__ __NOEDITSECTION__\n{{noexternallanglinks:ang|simple}}\n[[Category:Main page]]"}]}}}}
WWWWWW by Zoffix Znet is a library which provides easy-to-use API for fetching and parsing json very simply.
For instance, as the README says, you can easily get content by jget(URL)<HASHKEY>
style:
say jget('https://httpbin.org/get?foo=42&bar=x')<args><foo>;
To install WWW:
$ zef install WWW
Chart::GnuplotChart::Gnuplot by titsuki is a bindings for gnuplot.
To install Chart::Gnuplot:
$ zef install Chart::Gnuplot
In this article, we use this module; however, if you unfamiliar with gnuplot there are many alternatives: SVG::Plot, Graphics::PLplot, Call matplotlib functions via Inline::Python.
Stopwords from NLTKNLTK is a toolkit for natural language processing.
Not only APIs, it also provides corpus.
You can get stopwords for English via “70. Stopwords Corpus”: http://www.nltk.org/nltk_data/
Exercise 1: Get Quotations and Create Cleaned DocumentsAt the beginning, we have to get quotations from Wikiquote and create clean documents.
The main goal of this section is to create documents according to the following format:
<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...
<docid> <personid> <word> <word> <word> ...
The whole source code is:
use v6.c;
use WWW;
use URI::Escape;
sub get-members-from-category(Str $category --> List) {
my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
@(jget($member-url)<query><categorymembers>.map(*<title>));
}
sub get-pages(Str @members, Int $batch = 50 --> List) {
my Int $start = 0;
my @pages;
while $start < @members {
my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
@pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
$start += $batch;
}
@pages;
}
sub create-documents-from-pages(@pages, @members --> List) {
my @documents;
for @pages -> $page {
my @quotations = $page<body>.split("\n")\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst("[", "[", :g))\
.map(*.subst("]", "]", :g))\
.map(*.subst("&", "&", :g))\
.map(*.subst(" ", "", :g))\
.map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
.grep(/^\*<-[*]>/)\
.map(*.subst(/^\*\s+/, ""));
# Note: The order of array wikiquote API returned is agnostic.
my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
@documents.push(%(body => $_, personid => $index)) for @quotations;
}
@documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}
my Str @members = get-members-from-category("Category:1954_births");
my @pages = get-pages(@members);
my @documents = create-documents-from-pages(@pages, @members);
my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;
my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;
First, we get the members listed in the “Category:1954_births” page. I choosed the year that the Perl 6 designer was born in:
my Str @members = get-members-from-category("Category:1954_births");
where get-members-from-category
gets members via Wikiquote API:
sub get-members-from-category(Str $category --> List) {
my $member-url = "https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
@(jget($member-url)<query><categorymembers>.map(*<title>));
}
Next, call get-pages
:
my @pages = get-pages(@members);
get-pages
is a subroutine that gets pages of the given titles (i.e., members):
sub get-pages(Str @members, Int $batch = 50 --> List) {
my Int $start = 0;
my @pages;
while $start < @members {
my $list = @members[$start..^List($start + $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
my $url = "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
@pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
$start += $batch;
}
@pages;
}
where @members[$start..^List($start + $batch, +@members).min]
is a slice of length $batch
, and the elements of the slice are percent encoded by uri_escase
and joint by %7C
(i.e., percent encoded pipe symbol).
In this case, one of the resulting $list
is:
Mumia%20Abu-Jamal%7CRene%20Balcer%7CIain%20Banks%7CGerard%20Batten%7CChristie%20Brinkley%7CJames%20Cameron%20%28director%29%7CEugene%20Chadbourne%7CJackie%20Chan%7CChang%20Yu-hern%7CLee%20Child%7CHugo%20Ch%C3%A1vez%7CDon%20Coscarelli%7CElvis%20Costello%7CDaayiee%20Abdullah%7CThomas%20H.%20Davenport%7CGerardine%20DeSanctis%7CAl%20Di%20Meola%7CKevin%20Dockery%20%28author%29%7CJohn%20Doe%20%28musician%29%7CF.%20J.%20Duarte%7CIain%20Duncan%20Smith%7CHerm%20Edwards%7CAbdel%20Fattah%20el-Sisi%7CRob%20Enderle%7CRecep%20Tayyip%20Erdo%C4%9Fan%7CAlejandro%20Pe%C3%B1a%20Esclusa%7CHarvey%20Fierstein%7CCarly%20Fiorina%7CGary%20L.%20Francione%7CAshrita%20Furman%7CMary%20Gaitskill%7CGeorge%20Galloway%7C%C5%BDeljko%20Glasnovi%C4%87%7CGary%20Hamel%7CFran%C3%A7ois%20Hollande%7CKazuo%20Ishiguro%7CJean-Claude%20Juncker%7CAnish%20Kapoor%7CGuy%20Kawasaki%7CRobert%20Francis%20Kennedy%2C%20Jr.%7CLawrence%20M.%20Krauss%7CAnatoly%20Kudryavitsky%7CAnne%20Lamott%7CJoep%20Lange%7CAng%20Lee%7CLi%20Bin%7CRay%20Liotta%7CPeter%20Lipton%7CJames%20D.%20Macdonald%7CKen%20MacLeod
Note that get-pages
subroutine uses hash contextualizer %()
for creating a sequence of hash:
@pages.push($_) for jget($url)<query><pages>.map({ %(body => .<revisions>[0]<content>, title => .<title>) });
After that, we call create-documents-from-pages
:
my @documents = create-documents-from-pages(@pages, @members);
create-documents-from-pages
creates documents from each page:
sub create-documents-from-pages(@pages, @members --> List) {
my @documents;
for @pages -> $page {
my @quotations = $page<body>.split("\n")\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst("[", "[", :g))\
.map(*.subst("]", "]", :g))\
.map(*.subst("&", "&", :g))\
.map(*.subst(" ", "", :g))\
.map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, " ", :g))\
.grep(/^\*<-[*]>/)\
.map(*.subst(/^\*\s+/, ""));
# Note: The order of array wikiquote API returned is agnostic.
my Int $index = @members.pairs.grep({ .value eq $page<title> }).map(*.key).head;
@documents.push(%(body => $_, personid => $index)) for @quotations;
}
@documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid => .key, personid => .value<personid>, body => .value<body>) }).list
}
where .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))
and .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))
are coverting commands that extract texts for displaying and delete texts for internal linking from anchor texts. For example, [[Perl]]
is reduced into Perl
. For more syntax info, see: https://docs.perl6.org/language/regexes#Named_captures or https://docs.perl6.org/routine/subst
After some cleaning operations (.e.g., .map(*.subst("[", "[", :g))
), we extract quotation lines.
.grep(/^\*<-[*]>/)
finds lines starting with single asterisk because most of the quotations appear in such kind of lines.
Next, .map(*.subst(/^\*\s+/, ""))
deletes each asterisk since asterisk itself isn’t a constituent of each quotation.
Finally, we save the documents and members (i.e., titles):
my $docfh = open "documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join(" ")) for @documents;
$docfh.close;
my $memfh = open "members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;
Exercise 2: Execute LDA and Visualize the ResultIn the previous section, we saved the cleaned documents.
In this section, we use the documents for LDA estimation and visualize the result.
The goal of this section is to plot a document-topic distribution and write a topic-word table.
The whole source code is:
use v6.c;
use Algorithm::LDA;
use Algorithm::LDA::Formatter;
use Algorithm::LDA::LDAModel;
use Chart::Gnuplot;
use Chart::Gnuplot::Subset;
sub create-model(@documents --> Algorithm::LDA::LDAModel) {
my $stopwords = "stopwords/english".IO.lines.Set;
my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
$model
}
sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
.grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;
note("@documents[$docid] is selected");
my ($row-size, $col-size) = $model.document-topic-matrix.shape;
my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
$gnu.command("set boxwidth 0.5 relative");
my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
$gnu.legend(:off);
$gnu.xlabel(:label("Topic"));
$gnu.ylabel(:label("P(z|theta,d)"));
$gnu.xtics(:tics(@tics));
$gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
$gnu.dispose;
}
sub write-nbest($model) {
my $topics := $model.nbest-words-per-topic(10);
for ^(10/5) -> $part-i {
say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
for ^10 -> $rank {
say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
take @($topics)[$topic;$rank].key;
}.join("|") ~ "|";
}
"".say;
}
}
sub save-model($model) {
my @document-topic-matrix := $model.document-topic-matrix;
my ($document-size, $topic-size) = @document-topic-matrix.shape;
my $doctopicfh = open "document-topic.txt", :w;
$doctopicfh.say: ($document-size, $topic-size).join(" ");
for ^$document-size -> $doc-i {
$doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
}
$doctopicfh.close;
my @topic-word-matrix := $model.topic-word-matrix;
my ($, $word-size) = @topic-word-matrix.shape;
my $topicwordfh = open "topic-word.txt", :w;
$topicwordfh.say: ($topic-size, $word-size).join(" ");
for ^$topic-size -> $topic-i {
$topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
}
$topicwordfh.close;
my @vocabulary := $model.vocabulary;
my $vocabfh = open "vocabulary.txt", :w;
$vocabfh.say($_) for @vocabulary;
$vocabfh.close;
}
my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);
my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
write-nbest($model);
save-model($model);
First, we load the cleaned documents and call create-model
:
my @documents = "documents.txt".IO.lines;
my $model = create-model(@documents);
create-model
creates a LDA model by loading given documents:
sub create-model(@documents --> Algorithm::LDA::LDAModel) {
my $stopwords = "stopwords/english".IO.lines.Set;
my &tokenizer = -> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) = .words; @body.join(" ") }), &tokenizer);
my Algorithm::LDA $lda .= new(:$documents, :$vocabs);
my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
$model
}
where $stopwords
is a set of English stopwords from NLTK (I mentioned preliminary section), and &tokenizer
is a custom tokenizer for Algorithm::LDA::Formatter.from-plain
. The tokenizer converts given sentence as follows:
-
- Splits given sentence by whitespace and makes a list of tokens.
-
- Replaces each characters of the token with lower-case characters.
-
- Deletes token that exists in the stopwords list or one-length token categorized as Symbol or Punctuation.
Algorithm::LDA::Formatter.from-plain
creates numerical native documents (i.e., each word in a document is mapped to its corresponding vocabulary id, and this id is represented by C int32) and vocabulary from a list of texts.
After creating an Algorithm::LDA
instance using the above numerical documents, we can start LDA estimation by Algorithm::LDA.fit
. In this example, we set the number of topics to 10, and the number of iterations to 100, and the seed for srand to 2018.
Next, we plot a document-topic distribution. Before this plotting, we load the saved members:
my @members = "members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
plot-topic-distribution
plots topic distribution with Chart::Gnuplot:
sub plot-topic-distribution($model, @members, @documents, $search-regex = rx/Larry/) {
my $target-personid = @members.pairs.grep({ .value ~~ $search-regex }).map(*.key).head;
my $docid = @documents.map({ my ($docid, $personid, *@body) = .words; %(docid => $docid, personid => $personid, body => @body.join(" ")) })\
.grep({ .<personid> == $target-personid and .<body> ~~ /:i << perl >>/}).map(*<docid>).head;
note("@documents[$docid] is selected");
my ($row-size, $col-size) = $model.document-topic-matrix.shape;
my @doc-topic = gather for ($docid X ^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
my Chart::Gnuplot $gnu .= new(:terminal("png"), :filename("topics.png"));
$gnu.command("set boxwidth 0.5 relative");
my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
$gnu.legend(:off);
$gnu.xlabel(:label("Topic"));
$gnu.ylabel(:label("P(z|theta,d)"));
$gnu.xtics(:tics(@tics));
$gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
$gnu.dispose;
}
In this example, we plot topic distribution of a Larry Wall’s quotation (“Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.”):

After the plotting, we call write-nbest
:
write-nbest($model);
In LDA, what topic XXX represents is expressed as a list of words. write-nbest
writes a markdown style topic-word distribution table:
sub write-nbest($model) {
my $topics := $model.nbest-words-per-topic(10);
for ^(10/5) -> $part-i {
say "|" ~ (^5).map(-> $t { "topic { $part-i * 5 + $t }" }).join("|") ~ "|";
say "|" ~ (^5).map({ "----" }).join("|") ~ "|";
for ^10 -> $rank {
say "|" ~ gather for ($part-i * 5)..^($part-i * 5 + 5) -> $topic {
take @($topics)[$topic;$rank].key;
}.join("|") ~ "|";
}
"".say;
}
}
The result is:
topic 0
topic 1
topic 2
topic 3
topic 4
would
scotland
black
could
one
it’s
country
mr.
first
work
believe
one
lot
law
new
one
political
play
college
human
took
world
official
basic
process
much
need
new
speak
business
don’t
must
reacher
language
becomes
ever
national
five
every
good
far
many
car
matter
world
fighting
us
road
right
knowledge
topic 5
topic 6
topic 7
topic 8
topic 9
apple
united
people
like
*/
likely
war
would
one
die
company
states
i’m
something
und
jobs
years
know
think
quantum
even
would
think
way
play
steve
american
want
things
noble
life
president
get
perl
home
like
human
going
long
dog
end
must
say
always
student
small
us
go
really
ist
As you can see, the quotation of “Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.” contains “one”, “way” and “perl”. This is the reason why this quotation is mainly composed of topic 8.
For the next section, we save the model by save-model
subroutine:
sub save-model($model) {
my @document-topic-matrix := $model.document-topic-matrix;
my ($document-size, $topic-size) = @document-topic-matrix.shape;
my $doctopicfh = open "document-topic.txt", :w;
$doctopicfh.say: ($document-size, $topic-size).join(" ");
for ^$document-size -> $doc-i {
$doctopicfh.say: gather for ^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join(" ");
}
$doctopicfh.close;
my @topic-word-matrix := $model.topic-word-matrix;
my ($, $word-size) = @topic-word-matrix.shape;
my $topicwordfh = open "topic-word.txt", :w;
$topicwordfh.say: ($topic-size, $word-size).join(" ");
for ^$topic-size -> $topic-i {
$topicwordfh.say: gather for ^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join(" ");
}
$topicwordfh.close;
my @vocabulary := $model.vocabulary;
my $vocabfh = open "vocabulary.txt", :w;
$vocabfh.say($_) for @vocabulary;
$vocabfh.close;
}
Exercise 3: Create Quotation Search EngineIn this section, we create a quotation search engine which uses the model created in the previous section.
More specifically, we create LDA-based document model (Xing Wei and W. Bruce Croft 2006) and make a CLI tool that can search quotations. (Note that the words “token” and “word” are interchangable in this section)
The whole source code is:
use v6.c;
sub MAIN(Str :$query!) {
my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
my ($, $word-size) = topic-word-iter.pull-one.words;
my Num @document-topic[$document-size;$topic-size];
my Num @topic-word[$topic-size;$word-size];
for ^$document-size -> $doc-i {
my \maybe-line := doc-topic-iter.pull-one;
die "Error: Something went wrong" if maybe-line =:= IterationEnd;
my Num @line = @(maybe-line).words>>.Num;
for ^@line {
@document-topic[$doc-i;$_] = @line[$_];
}
}
for ^$topic-size -> $topic-i {
my \maybe-line := topic-word-iter.pull-one;
die "Error: Something went wrong" if maybe-line =:= IterationEnd;
my Num @line = @(maybe-line).words>>.Num;
for ^@line {
@topic-word[$topic-i;$_] = @line[$_];
}
}
my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
my @members = "members.txt".IO.lines;
my @documents = "documents.txt".IO.lines;
my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;
my @query = $query.words.map(*.lc);
my @sorted-list = gather for ^$document-size -> $doc-i {
my Num $log-prob = gather for @query -> $token {
my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
}.sum;
take %(doc-i => $doc-i, log-prob => $log-prob);
}.sort({ $^b<log-prob> <=> $^a<log-prob> });
for ^10 {
my $docid = @sorted-list[$_]<doc-i>;
sprintf("\"%s\" by %s %f", @docbodies[$docid], @members[%doc-to-person{$docid}], @sorted-list[$_]<log-prob>).say;
}
}
sub Pml(@docbodies, $doc-i, $token --> Num) {
my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
my Int $total-tokens = @docbodies[$doc-i].words.elems;
return -100e0 if $total-tokens == 0 or $num-tokens == 0;
log($num-tokens) - log($total-tokens);
}
sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
gather for ^$topic-size -> $topic {
if %vocabulary{$token}:exists {
take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
} else {
take -100e0;
}
}.reduce(&log-sum);
}
sub log-sum(Num $log-a, Num $log-b --> Num) {
if $log-a < $log-b {
return $log-b + log(1 + exp($log-a - $log-b))
} else {
return $log-a + log(1 + exp($log-b - $log-a))
}
}
At the beginning, we load the saved model and prepare @document-topic
, @topic-word
, %vocabulary
, @documents
, @docbodies
, %doc-to-person
and @members
:
my \doc-topic-iter = "document-topic.txt".IO.lines.iterator;
my \topic-word-iter = "topic-word.txt".IO.lines.iterator;
my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
my ($, $word-size) = topic-word-iter.pull-one.words;
my Num @document-topic[$document-size;$topic-size];
my Num @topic-word[$topic-size;$word-size];
for ^$document-size -> $doc-i {
my \maybe-line = doc-topic-iter.pull-one;
die "Error: Something went wrong" if maybe-line =:= IterationEnd;
my Num @line = @(maybe-line).words>>.Num;
for ^@line {
@document-topic[$doc-i;$_] = @line[$_];
}
}
for ^$topic-size -> $topic-i {
my \maybe-line = topic-word-iter.pull-one;
die "Error: Something went wrong" if maybe-line =:= IterationEnd;
my Num @line = @(maybe-line).words>>.Num;
for ^@line {
@topic-word[$topic-i;$_] = @line[$_];
}
}
my %vocabulary = "vocabulary.txt".IO.lines.pairs>>.antipair.hash;
my @members = "members.txt".IO.lines;
my @documents = "documents.txt".IO.lines;
my @docbodies = @documents.map({ my ($, $, *@body) = .words; @body.join(" ") });
my %doc-to-person = @documents.map({ my ($docid, $personid, $) = .words; %($docid => $personid) }).hash;
Next, we set @query
using option :$query
:
my @query = $query.words.map(*.lc);
After that, we compute the probability of P(query|document)
based on Eq. 9 of the aforementioned paper (Note that we use logarithm to avoid undeflow and set the parameter mu to zero) and sort them.
my @sorted-list = gather for ^$document-size -> $doc-i {
my Num $log-prob = gather for @query -> $token {
my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
take log-sum(log(0.2) + $log-ml-prob, log(0.8) + $log-lda-prob);
}.sum;
take %(doc-i => $doc-i, log-prob => $log-prob);
}.sort({ $^b<log-prob> <=> $^a<log-prob> });
Plda
adds logarithmic topic given document probability (i.e., lnP(topic|theta,document)) and word given topic probability (i.e., lnP(word|phi,topic)) for each topic, and sums them by .reduce(&log-sum);
:
sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {
gather for ^$topic-size -> $topic {
if %vocabulary{$token}:exists {
take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
} else {
take -100e0;
}
}.reduce(&log-sum);
}
and Pml
(ml means Maximum Likelihood) counts $token
and normalizes it by the number of the total tokens in the document (Note that this computation is also conducted in log space):
sub Pml(@docbodies, $doc-i, $token --> Num) {
my Int $num-tokens = @docbodies[$doc-i].words.grep({ /:i^ $token $/ }).elems;
my Int $total-tokens = @docbodies[$doc-i].words.elems;
return -100e0 if $total-tokens == 0 or $num-tokens == 0;
log($num-tokens) - log($total-tokens);
}
OK, then let’s execute!
query “perl”:
$ perl6 search-quotation.p6 --query="perl"
"Perl will always provide the null." by Larry Wall -3.301156
"Perl programming is an *empirical* science!" by Larry Wall -3.345189
"The whole intent of Perl 5's module system was to encourage the growth of Perl culture rather than the Perl core." by Larry Wall -3.490238
"I dunno, I dream in Perl sometimes..." by Larry Wall -3.491790
"At many levels, Perl is a 'diagonal' language." by Larry Wall -3.575779
"Almost nothing in Perl serves a single purpose." by Larry Wall -3.589218
"Perl has a long tradition of working around compilers." by Larry Wall -3.674111
"As for whether Perl 6 will replace Perl 5, yeah, probably, in about 40 years or so." by Larry Wall -3.684454
"Well, I think Perl should run faster than C." by Larry Wall -3.771155
"It's certainly easy to calculate the average attendance for Perl conferences." by Larry Wall -3.864075
query “apple”:
$ perl6 search-quotation.p6 --query="apple"
"Steve Jobs is the"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.841538
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.904489
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -3.940359
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.152145
"[W]hen Apple wanted the name "iPhone" and it was owned by Cisco, Steve Jobs just took it, and his legal team executed so he could keep it. It turned out that doing this was surprisingly inexpensive. And, as the Apple Watch showcased, the Apple Phone likely would not have sold anywhere near as well as the iPhone." by Rob Enderle -4.187223
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.318575
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.380863
"Apple no longer owns the tablet market, and will likely lose dominance this year or next. ... this level of sustained dominance doesn't appear to recur with the same vendor even if it launched the category." by Rob Enderle -4.397954
"Apple is becoming more and more like a typical tech firm — that is, long on technology and short on magic. ... Apple is drifting closer and closer to where it was back in the 1990s. It offers advancements that largely follow those made by others years earlier, product proliferation, a preference for more over simple elegance, and waning excitement." by Rob Enderle -4.448473
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.469852 Ronald McDonald of Apple, he is the face." by Rob Enderle -3.822949
"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.849055
"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.895163
"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -4.052616
"*** The previous line contains the naughty word '$&'.\n if /(ibm|apple|awk)/; # :-)" by Larry Wall -4.088445
"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.169533
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.197869
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.204618
"Today's tech companies aren't built to last, as Apple's recent earnings report shows all too well." by Rob Enderle -4.209901
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.238582
ConclusionsIn this article, we explored Wikiquote and created a LDA model using Algoritm::LDA.
After that we built an information retrieval application.
Thanks for reading my article! See you next time!
Citations
- Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.
- Wei, Xing, and W. Bruce Croft. “LDA-based document models for ad-hoc retrieval.” Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.
License
- Wikiquote is licensed under the CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/
3 thoughts on “Day 24 – Topic Modeling with Perl 6”
Leave a Reply
I encourage you to publish a paper in Figshare, ArXiv or one of the other open platforms, to enrich the pool of papers using Perl 6 that have bene published, which is very small by now.
Thanks for your comment.
> publish a paper in Figshare, ArXiv or one of the other open platforms
Sounds interesting! I didn’t come up with such a plan.
If I have enough time to implement baselines and design evaluation metrics, maybe I’ll try it :)