This school semester I took my first proof-based class titled “Intro to Mathematical Proof Workshop”. After having taken other math classes (Calculus, Matrix Algebra, etc.), I felt that I didn’t have that much of a mathematical foundation and up to this point, all I had been doing was purely computational mathematics sprinkled with some proofs here and there. Looking back, I found the class quite enjoyable and learning about different theorems and their proofs, mostly from number theory, has given me a new perspective of mathematics.
“How is this related to Perl 6?”, you might be asking. As I mentioned, most of the proofs that were discussed either in class or left for homework were related to number theory. If there’s one thing Perl 6 and number theory have in common is their accessibility. Similar to how the content of the elementary theory of numbers can be tangible and familiar, Perl 6 can be quite approachable to beginners. In fact, beginners are encouraged to write what’s known as “baby Perl”.
Today, let me introduce Algorithm::LDA.
This module is a Latent Dirichlet Allocation (i.e., LDA) implementation for topic modeling.
Introduction
What’s LDA? LDA is one of the popular unsupervised machine learning methods.
It models document generation process and represents each document as a mixture of topics.
So, what does “a mixture of topics” mean? Fig. 1 shows an article in which some of the words are highlighted in three colors: yellow, pink, and blue. Words about genetics are marked in yellow; words about evolutionary biology are marked in pink; words about data analysis are marked in blue. Imagine all of the words in this article are colored, then we can represent this article as a mixture of topics (i.e., colors).
Fig. 1:
(This image is from “Probabilistic topic models.” (David Blei 2012))
OK, then I’ll demonstrate how to use Algorithm::LDA in the next section.
Modeling Quotations
In this article, we explore Wikiquote. Wikiquote is a cloud-sourced platform providing sourced quotations.
By using Wikiquote API, we get quotations that are used for LDA estimation. After that, we execute LDA and plot the result.
Finally, we create an information retrieval application using the resulting model.
Preliminary
Wikiquote API
Wikiquote has action API that provides means for getting Wikiquote resources.
For example, you can get content of the Main Page as follows:
{"batchcomplete":"","warnings":{"main":{"*":"Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application."},"revisions":{"*":"Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."}},"query":{"pages":{"1":{"pageid":1,"ns":0,"title":"Main Page","revisions":[{"contentformat":"text/x-wiki","contentmodel":"wikitext","*":"
WWW by Zoffix Znet is a library which provides easy-to-use API for fetching and parsing json very simply.
For instance, as the README says, you can easily get content by jget(URL)<HASHKEY> style:
say jget('https://httpbin.org/get?foo=42&bar=x')<args><foo>;
NLTK is a toolkit for natural language processing.
Not only APIs, it also provides corpus.
You can get stopwords for English via “70. Stopwords Corpus”: http://www.nltk.org/nltk_data/
Exercise 1: Get Quotations and Create Cleaned Documents
At the beginning, we have to get quotations from Wikiquote and create clean documents.
The main goal of this section is to create documents according to the following format:
where get-members-from-category gets members via Wikiquote API:
sub get-members-from-category(Str $category -->List) {
my $member-url ="https://en.wikiquote.org/w/api.php?action=query&list=categorymembers&cmtitle={$category}&cmlimit=100&format=json";
@(jget($member-url)<query><categorymembers>.map(*<title>));
}
Next, call get-pages:
my @pages = get-pages(@members);
get-pages is a subroutine that gets pages of the given titles (i.e., members):
sub get-pages(Str @members, Int $batch =50-->List) {
myInt $start=0;
my @pages;
while $start< @members {
my $list = @members[$start..^List($start+ $batch, +@members).min].map({ uri_escape($_) }).join('%7C');
my $url ="https://en.wikiquote.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&formatversion=2&titles={$list}";
@pages.push($_) for jget($url)<query><pages>.map({ %(body =>.<revisions>[0]<content>, title =>.<title>) });
$start+= $batch;
}
@pages;
}
where @members[$start..^List($start + $batch, +@members).min] is a slice of length $batch, and the elements of the slice are percent encoded by uri_escase and joint by %7C (i.e., percent encoded pipe symbol).
In this case, one of the resulting $list is:
Note that get-pages subroutine uses hash contextualizer %() for creating a sequence of hash:
@pages.push($_) for jget($url)<query><pages>.map({ %(body =>.<revisions>[0]<content>, title =>.<title>) });
After that, we call create-documents-from-pages:
my @documents = create-documents-from-pages(@pages, @members);
create-documents-from-pages creates documents from each page:
sub create-documents-from-pages(@pages, @members -->List) {
my @documents;
for @pages -> $page {
my @quotations = $page<body>.split("\n")\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g))\
.map(*.subst("[", "[", :g))\
.map(*.subst("]", "]", :g))\
.map(*.subst("&", "&", :g))\
.map(*.subst(" ", "", :g))\
.map(*.subst(/:i [ \<\/?\s?br\> | \<br\s?\/?\> ]/, "", :g))\
.grep(/^\*<-[*]>/)\
.map(*.subst(/^\*\s+/, ""));
# Note: The order of array wikiquote API returned is agnostic.myInt $index= @members.pairs.grep({ .valueeq $page<title> }).map(*.key).head;
@documents.push(%(body =>$_, personid => $index)) for @quotations;
}
@documents.sort({ $^a<personid> <=> $^b<personid> }).pairs.map({ %(docid =>.key, personid =>.value<personid>, body =>.value<body>) }).list
}
where .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\|$<link>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) and .map(*.subst(/\[\[$<text>=(<-[\[\]|]>+?)\]\]/, { $<text> }, :g)) are coverting commands that extract texts for displaying and delete texts for internal linking from anchor texts. For example, [[Perl]] is reduced into Perl. For more syntax info, see: https://docs.perl6.org/language/regexes#Named_captures or https://docs.perl6.org/routine/subst
After some cleaning operations (.e.g., .map(*.subst("[", "[", :g))), we extract quotation lines. .grep(/^\*<-[*]>/) finds lines starting with single asterisk because most of the quotations appear in such kind of lines.
Next, .map(*.subst(/^\*\s+/, "")) deletes each asterisk since asterisk itself isn’t a constituent of each quotation.
Finally, we save the documents and members (i.e., titles):
my $docfh =open"documents.txt", :w;
$docfh.say((.<docid>, .<personid>, .<body>).join("")) for @documents;
$docfh.close;
my $memfh =open"members.txt", :w;
$memfh.say($_) for @members;
$memfh.close;
Exercise 2: Execute LDA and Visualize the Result
In the previous section, we saved the cleaned documents.
In this section, we use the documents for LDA estimation and visualize the result.
The goal of this section is to plot a document-topic distribution and write a topic-word table.
The whole source code is:
usev6.c;
use Algorithm::LDA;
use Algorithm::LDA::Formatter;
use Algorithm::LDA::LDAModel;
use Chart::Gnuplot;
use Chart::Gnuplot::Subset;
sub create-model(@documents --> Algorithm::LDA::LDAModel) {
my $stopwords ="stopwords/english".IO.lines.Set;
my &tokenizer =-> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) =.words; @body.join("") }), &tokenizer);
my Algorithm::LDA $lda .=new(:$documents, :$vocabs);
my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
$model
}
sub plot-topic-distribution($model, @members, @documents, $search-regex =rx/Larry/) {
my $target-personid = @members.pairs.grep({ .value~~ $search-regex }).map(*.key).head;
my $docid = @documents.map({ my ($docid, $personid, *@body) =.words; %(docid => $docid, personid => $personid, body => @body.join("")) })\
.grep({ .<personid> == $target-personid and.<body> ~~ /:i<< perl >>/}).map(*<docid>).head;
note("@documents[$docid] is selected");
my ($row-size, $col-size) = $model.document-topic-matrix.shape;
my @doc-topic =gatherfor ($docid X^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
my Chart::Gnuplot $gnu .=new(:terminal("png"), :filename("topics.png"));
$gnu.command("set boxwidth 0.5 relative");
my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
$gnu.legend(:off);
$gnu.xlabel(:label("Topic"));
$gnu.ylabel(:label("P(z|theta,d)"));
$gnu.xtics(:tics(@tics));
$gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
$gnu.dispose;
}
sub write-nbest($model) {
my $topics := $model.nbest-words-per-topic(10);
for^(10/5) -> $part-i {
say"|"~ (^5).map(-> $t { "topic { $part-i *5+ $t }" }).join("|") ~"|";
say"|"~ (^5).map({ "----" }).join("|") ~"|";
for^10-> $rank {
say"|"~gatherfor ($part-i *5)..^($part-i *5+5) -> $topic {
take@($topics)[$topic;$rank].key;
}.join("|") ~"|";
}
"".say;
}
}
sub save-model($model) {
my @document-topic-matrix := $model.document-topic-matrix;
my ($document-size, $topic-size) = @document-topic-matrix.shape;
my $doctopicfh =open"document-topic.txt", :w;
$doctopicfh.say: ($document-size, $topic-size).join("");
for^$document-size -> $doc-i {
$doctopicfh.say:gatherfor^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join("");
}
$doctopicfh.close;
my @topic-word-matrix := $model.topic-word-matrix;
my ($, $word-size) = @topic-word-matrix.shape;
my $topicwordfh =open"topic-word.txt", :w;
$topicwordfh.say: ($topic-size, $word-size).join("");
for^$topic-size -> $topic-i {
$topicwordfh.say:gatherfor^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join("");
}
$topicwordfh.close;
my @vocabulary := $model.vocabulary;
my $vocabfh =open"vocabulary.txt", :w;
$vocabfh.say($_) for @vocabulary;
$vocabfh.close;
}
my @documents ="documents.txt".IO.lines;
my $model = create-model(@documents);
my @members ="members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
write-nbest($model);
save-model($model);
First, we load the cleaned documents and call create-model:
my @documents ="documents.txt".IO.lines;
my $model = create-model(@documents);
create-model creates a LDA model by loading given documents:
sub create-model(@documents --> Algorithm::LDA::LDAModel) {
my $stopwords ="stopwords/english".IO.lines.Set;
my &tokenizer =-> $line { $line.words.map(*.lc).grep(-> $w { ($stopwords !(cont) $w) and $w !~~ /^[ <:S> | <:P> ]+$/ }) };
my ($documents, $vocabs) = Algorithm::LDA::Formatter.from-plain(@documents.map({ my ($, $, *@body) =.words; @body.join("") }), &tokenizer);
my Algorithm::LDA $lda .=new(:$documents, :$vocabs);
my Algorithm::LDA::LDAModel $model = $lda.fit(:num-topics(10), :num-iterations(500), :seed(2018));
$model
}
where $stopwords is a set of English stopwords from NLTK (I mentioned preliminary section), and &tokenizer is a custom tokenizer for Algorithm::LDA::Formatter.from-plain. The tokenizer converts given sentence as follows:
Splits given sentence by whitespace and makes a list of tokens.
Replaces each characters of the token with lower-case characters.
Deletes token that exists in the stopwords list or one-length token categorized as Symbol or Punctuation.
Algorithm::LDA::Formatter.from-plain creates numerical native documents (i.e., each word in a document is mapped to its corresponding vocabulary id, and this id is represented by C int32) and vocabulary from a list of texts.
After creating an Algorithm::LDA instance using the above numerical documents, we can start LDA estimation by Algorithm::LDA.fit. In this example, we set the number of topics to 10, and the number of iterations to 100, and the seed for srand to 2018.
Next, we plot a document-topic distribution. Before this plotting, we load the saved members:
my @members ="members.txt".IO.lines;
plot-topic-distribution($model, @members, @documents);
plot-topic-distribution plots topic distribution with Chart::Gnuplot:
sub plot-topic-distribution($model, @members, @documents, $search-regex =rx/Larry/) {
my $target-personid = @members.pairs.grep({ .value~~ $search-regex }).map(*.key).head;
my $docid = @documents.map({ my ($docid, $personid, *@body) =.words; %(docid => $docid, personid => $personid, body => @body.join("")) })\
.grep({ .<personid> == $target-personid and.<body> ~~ /:i<< perl >>/}).map(*<docid>).head;
note("@documents[$docid] is selected");
my ($row-size, $col-size) = $model.document-topic-matrix.shape;
my @doc-topic =gatherfor ($docid X^$col-size) -> ($i, $j) { take $model.document-topic-matrix[$i;$j]; }
my Chart::Gnuplot $gnu .=new(:terminal("png"), :filename("topics.png"));
$gnu.command("set boxwidth 0.5 relative");
my AnyTicsTic @tics = @doc-topic.pairs.map({ %(:label(.key), :pos(.key)) });
$gnu.legend(:off);
$gnu.xlabel(:label("Topic"));
$gnu.ylabel(:label("P(z|theta,d)"));
$gnu.xtics(:tics(@tics));
$gnu.plot(:vertices(@doc-topic.pairs.map({ @(.key, .value.exp) })), :style("boxes"), :fill("solid"));
$gnu.dispose;
}
In this example, we plot topic distribution of a Larry Wall’s quotation (“Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.”):
After the plotting, we call write-nbest:
write-nbest($model);
In LDA, what topic XXX represents is expressed as a list of words. write-nbest writes a markdown style topic-word distribution table:
As you can see, the quotation of “Although the Perl Slogan is There’s More Than One Way to Do It, I hesitate to make 10 ways to do something.” contains “one”, “way” and “perl”. This is the reason why this quotation is mainly composed of topic 8.
For the next section, we save the model by save-model subroutine:
sub save-model($model) {
my @document-topic-matrix := $model.document-topic-matrix;
my ($document-size, $topic-size) = @document-topic-matrix.shape;
my $doctopicfh =open"document-topic.txt", :w;
$doctopicfh.say: ($document-size, $topic-size).join("");
for^$document-size -> $doc-i {
$doctopicfh.say:gatherfor^$topic-size -> $topic { take @document-topic-matrix[$doc-i;$topic] }.join("");
}
$doctopicfh.close;
my @topic-word-matrix := $model.topic-word-matrix;
my ($, $word-size) = @topic-word-matrix.shape;
my $topicwordfh =open"topic-word.txt", :w;
$topicwordfh.say: ($topic-size, $word-size).join("");
for^$topic-size -> $topic-i {
$topicwordfh.say:gatherfor^$word-size -> $word { take @topic-word-matrix[$topic-i;$word] }.join("");
}
$topicwordfh.close;
my @vocabulary := $model.vocabulary;
my $vocabfh =open"vocabulary.txt", :w;
$vocabfh.say($_) for @vocabulary;
$vocabfh.close;
}
Exercise 3: Create Quotation Search Engine
In this section, we create a quotation search engine which uses the model created in the previous section.
More specifically, we create LDA-based document model (Xing Wei and W. Bruce Croft 2006) and make a CLI tool that can search quotations. (Note that the words “token” and “word” are interchangable in this section)
The whole source code is:
usev6.c;
sub MAIN(Str:$query!) {
my \doc-topic-iter ="document-topic.txt".IO.lines.iterator;
my \topic-word-iter ="topic-word.txt".IO.lines.iterator;
my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
my ($, $word-size) = topic-word-iter.pull-one.words;
myNum @document-topic[$document-size;$topic-size];
myNum @topic-word[$topic-size;$word-size];
for^$document-size -> $doc-i {
my \maybe-line := doc-topic-iter.pull-one;
die"Error: Something went wrong"if maybe-line =:= IterationEnd;
myNum @line =@(maybe-line).words>>.Num;
for^@line {
@document-topic[$doc-i;$_] = @line[$_];
}
}
for^$topic-size -> $topic-i {
my \maybe-line := topic-word-iter.pull-one;
die"Error: Something went wrong"if maybe-line =:= IterationEnd;
myNum @line =@(maybe-line).words>>.Num;
for^@line {
@topic-word[$topic-i;$_] = @line[$_];
}
}
my%vocabulary ="vocabulary.txt".IO.lines.pairs>>.antipair.hash;
my @members ="members.txt".IO.lines;
my @documents ="documents.txt".IO.lines;
my @docbodies = @documents.map({ my ($, $, *@body) =.words; @body.join("") });
my%doc-to-person = @documents.map({ my ($docid, $personid, $) =.words; %($docid => $personid) }).hash;
my @query = $query.words.map(*.lc);
my @sorted-list =gatherfor^$document-size -> $doc-i {
myNum $log-prob =gatherfor @query -> $token {
my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
take log-sum(log(0.2) +$log-ml-prob, log(0.8) +$log-lda-prob);
}.sum;
take%(doc-i => $doc-i, log-prob => $log-prob);
}.sort({ $^b<log-prob> <=> $^a<log-prob> });
for^10 {
my $docid = @sorted-list[$_]<doc-i>;
sprintf("\"%s\" by %s %f", @docbodies[$docid], @members[%doc-to-person{$docid}], @sorted-list[$_]<log-prob>).say;
}
}
sub Pml(@docbodies, $doc-i, $token --> Num) {
my Int $num-tokens = @docbodies[$doc-i].words.grep({/:i^ $token $/ }).elems;
myInt $total-tokens = @docbodies[$doc-i].words.elems;
return-100e0if $total-tokens ==0or $num-tokens ==0;
log($num-tokens) -log($total-tokens);
}
sub Plda($token, $topic-size, $doc-i, %vocabulary is raw, @document-topic is raw, @topic-word is raw --> Num) {gatherfor^$topic-size -> $topic {
if%vocabulary{$token}:exists {
take @document-topic[$doc-i;$topic] + @topic-word[$topic;%vocabulary{$token}];
} else {
take-100e0;
}
}.reduce(&log-sum);
}
sub log-sum(Num $log-a, Num $log-b --> Num) {if $log-a < $log-b {
return $log-b +log(1+exp($log-a - $log-b))
} else {
return $log-a +log(1+exp($log-b - $log-a))
}
}
At the beginning, we load the saved model and prepare @document-topic, @topic-word, %vocabulary, @documents, @docbodies, %doc-to-person and @members:
my \doc-topic-iter ="document-topic.txt".IO.lines.iterator;
my \topic-word-iter ="topic-word.txt".IO.lines.iterator;
my ($document-size, $topic-size) = doc-topic-iter.pull-one.words;
my ($, $word-size) = topic-word-iter.pull-one.words;
myNum @document-topic[$document-size;$topic-size];
myNum @topic-word[$topic-size;$word-size];
for^$document-size -> $doc-i {
my \maybe-line = doc-topic-iter.pull-one;
die"Error: Something went wrong"if maybe-line =:= IterationEnd;
myNum @line =@(maybe-line).words>>.Num;
for^@line {
@document-topic[$doc-i;$_] = @line[$_];
}
}
for^$topic-size -> $topic-i {
my \maybe-line = topic-word-iter.pull-one;
die"Error: Something went wrong"if maybe-line =:= IterationEnd;
myNum @line =@(maybe-line).words>>.Num;
for^@line {
@topic-word[$topic-i;$_] = @line[$_];
}
}
my%vocabulary ="vocabulary.txt".IO.lines.pairs>>.antipair.hash;
my @members ="members.txt".IO.lines;
my @documents ="documents.txt".IO.lines;
my @docbodies = @documents.map({ my ($, $, *@body) =.words; @body.join("") });
my%doc-to-person = @documents.map({ my ($docid, $personid, $) =.words; %($docid => $personid) }).hash;
Next, we set @query using option :$query:
my @query = $query.words.map(*.lc);
After that, we compute the probability of P(query|document) based on Eq. 9 of the aforementioned paper (Note that we use logarithm to avoid undeflow and set the parameter mu to zero) and sort them.
my @sorted-list =gatherfor^$document-size -> $doc-i {
myNum $log-prob =gatherfor @query -> $token {
my Num $log-ml-prob = Pml(@docbodies, $doc-i, $token);
my Num $log-lda-prob = Plda($token, $topic-size, $doc-i, %vocabulary, @document-topic, @topic-word);
take log-sum(log(0.2) +$log-ml-prob, log(0.8) +$log-lda-prob);
}.sum;
take%(doc-i => $doc-i, log-prob => $log-prob);
}.sort({ $^b<log-prob> <=> $^a<log-prob> });
Plda adds logarithmic topic given document probability (i.e., lnP(topic|theta,document)) and word given topic probability (i.e., lnP(word|phi,topic)) for each topic, and sums them by .reduce(&log-sum);:
and Pml (ml means Maximum Likelihood) counts $token and normalizes it by the number of the total tokens in the document (Note that this computation is also conducted in log space):
sub Pml(@docbodies, $doc-i, $token --> Num) {
my Int $num-tokens = @docbodies[$doc-i].words.grep({/:i^ $token $/ }).elems;
myInt $total-tokens = @docbodies[$doc-i].words.elems;
return-100e0if $total-tokens ==0or $num-tokens ==0;
log($num-tokens) -log($total-tokens);
}
OK, then let’s execute!
query “perl”:
$ perl6 search-quotation.p6 --query="perl""Perl will always provide the null." by Larry Wall -3.301156
"Perl programming is an *empirical* science!" by Larry Wall -3.345189
"The whole intent of Perl 5's module system was to encourage the growth of Perl culture rather than the Perl core." by Larry Wall -3.490238
"I dunno, I dream in Perl sometimes..." by Larry Wall -3.491790
"At many levels, Perl is a 'diagonal' language." by Larry Wall -3.575779
"Almost nothing in Perl serves a single purpose." by Larry Wall -3.589218
"Perl has a long tradition of working around compilers." by Larry Wall -3.674111
"As for whether Perl 6 will replace Perl 5, yeah, probably, in about 40 years or so." by Larry Wall -3.684454
"Well, I think Perl should run faster than C." by Larry Wall -3.771155
"It's certainly easy to calculate the average attendance for Perl conferences." by Larry Wall -3.864075
query “apple”:
$ perl6 search-quotation.p6 --query="apple""Steve Jobs is the"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.841538"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.904489"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -3.940359"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.152145"[W]hen Apple wanted the name "iPhone" and it was owned by Cisco, Steve Jobs just took it, and his legal team executed so he could keep it. It turned out that doing this was surprisingly inexpensive. And, as the Apple Watch showcased, the Apple Phone likely would not have sold anywhere near as well as the iPhone." by Rob Enderle -4.187223"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.318575"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.380863"Apple no longer owns the tablet market, and will likely lose dominance this year or next. ... this level of sustained dominance doesn't appear to recur with the same vendor even if it launched the category." by Rob Enderle -4.397954"Apple is becoming more and more like a typical tech firm — that is, long on technology and short on magic. ... Apple is drifting closer and closer to where it was back in the 1990s. It offers advancements that largely follow those made by others years earlier, product proliferation, a preference for more over simple elegance, and waning excitement." by Rob Enderle -4.448473"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.469852 Ronald McDonald of Apple, he is the face." by Rob Enderle -3.822949"With phones moving to technologies such as Apple Pay, an unwillingness to assure security could create a Target-like exposure that wipes Apple out of the market." by Rob Enderle -3.849055"*:From Joint Apple / HP press release dated 1 January 2004 available [http://www.apple.com/pr/library/2004/jan/08hp.html here]." by Carly Fiorina -3.895163"Samsung did to Apple what Apple did to Microsoft, skewering its devoted users and reputation, only better. ... There is a way for Apple to fight back, but the company no longer has that skill, and apparently doesn't know where to get it, either." by Rob Enderle -4.052616"*** The previous line contains the naughty word '$&'.\nif /(ibm|apple|awk)/;# :-)" by Larry Wall -4.088445"The cause of [Apple v. Qualcomm] appears to be an effort by Apple to pressure Qualcomm into providing a unique discount, largely because Apple has run into an innovation wall, is under increased competition from firms like Samsung, and has moved to a massive cost reduction strategy. (I've never known this to end well, as it causes suppliers to create unreliable components and outright fail.)" by Rob Enderle -4.169533
"[T]he litigation between Qualcomm and Apple/Intel ... is weird. What makes it weird is that Intel appears to think that by helping Apple drive down Qualcomm prices, it will gain an advantage, but since its only value is as a lower cost, lower performing, alternative to Qualcomm's modems, the result would be more aggressively priced better alternatives to Intel's offerings from Qualcomm/Broadcom, wiping Intel out of the market. On paper, this is a lose/lose for Intel and even for Apple. The lower prices would flow to Apple competitors as well, lowering the price of competing phones. So, Apple would not get a lasting benefit either." by Rob Enderle -4.197869
"Apple tends to aggressively work to not discover problems with products that are shipped and certainly not talk about them." by Rob Enderle -4.204618
"Today's tech companies aren't built to last, as Apple's recent earnings report shows all too well." by Rob Enderle -4.209901
"[W]hen it came to the iWatch, also a name that Apple didn't own, Apple walked away from it and instead launched the Apple Watch. Certainly, no risk of litigation, but the product's sales are a fraction of what they otherwise might have been with the proper name and branding." by Rob Enderle -4.238582
Conclusions
In this article, we explored Wikiquote and created a LDA model using Algoritm::LDA.
After that we built an information retrieval application.
Thanks for reading my article! See you next time!
Citations
Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.
Wei, Xing, and W. Bruce Croft. “LDA-based document models for ad-hoc retrieval.” Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.
I’ve already mentioned Bisectable in one of the advent posts two years ago, but since then a lot has changed, so I think it’s time to give a brief history of the bisectable bot and its friends.
First of all, let’s define the problem that is being solved. Sometimes it happens that a commit introduces an unintended change in behavior (a bug). Usually we call that a regression, and in some cases the easiest way to figure out what went wrong and fix it is to first find which commit introduced the regression.
There are exactly 9000 commits between Rakudo 2015.12 and 2018.12, and even though it’s not over 9000, that’s still a lot.
A good amount of my work time this year has been spent on building a couple of Perl 6 applications. After a decade of contributing to Perl 6 compiler and runtime development, it feels great to finally be using it to deliver production solutions solving real-world problems. I’m still not sure whether writing code in an IDE I founded, using a HTTP library I designed, compiled by a compiler I implemented large parts of, and running on a VM that I play architect for, makes me one of the world’s worst cases of “Not Invented Here”, or just really Full Stack.
Whatever I’m working on, I highly value automated testing. Each passing test is something I know works – and something that I won’t break as I evolve the software in question. Even with automated tests, bugs happen, but adding a test to cover the bug at least means I’ll make different bugs in the future, which is perhaps a bit more forgivable. Continue reading “Day 22 – Testing Cro HTTP APIs”→
The year is ending and we have a lot to celebrate! What is a better way to celebrate the end of the year than with our family and friends? To help achieve that, here at my home, we decided to run a Secret Santa Game! So, my goal is to write a Secret Santa Program! That’s something where I can use this wonderful project called Red.
Red is an ORM (Object Relational Model) for perl6 still under development and not published as a module yet. But it’s growing and it is close to a release.
So let’s create our first table: a table that will store the people participating in our Secret Santa. To the code:
Advent is an exciting time, a time of anticipation. And not only for us humans — it is the time when elves become most inventive. Today, I want to take some leisure time out of the Christmas stress to report about some pioneering work that is being done in the area of gift wrapping. Even if you didn’t anticipate any news from there, this report might still help you improve your technique, as — I don’t have to remind you — Christmas is approaching fast.
Do you know which presents small children like most? Large presents. Therefore, the Present Enlargement Research Lab at Northpole is tasked with finding practical ways to make presents larger. Now, “large” can mean multiple things. I will admit that the 6th unit is bending the meaning a bit, but their work is by far the most interesting: they increase the volume of presents, by increasing the dimension of the gift boxes.
I am a big fan of roleplaying games like Dungeons and Dragons. Most of these games have screens to help you hide what you’re doing when running the game and give you some of the charts used in the game to reduce looking stuff up in the books.
My game collection is extensive though and I’d much rather use my laptop to not only hide behind and track information but I could also automate dice rolls and chart usage. Whilst I could cobble some stuff together with text editors and command line magic I’d much rather have some snazzy Desktops Apps that I can show off to people.
Enter GTK::Simple, a wrapper around the gtk3 UI library used by the Linux Gnome desktop but also available on Windows and Mac. This library gives you a simple to use interface, via the power of NativeCall, to let you create simple desktop applications.
Christmas trees are a traditional symbol that date back more than four hundred years in Europe, so what could be better for an advent article than something about creating Christmas tree images.
The typical, simplified, representation of the tree is several triangles of decreasing size stacked on top of each other with a small over-lap, so are fairly easy to create with a computer program.
Here I’ll use Scalable Vector Graphics (SVG) to draw the image as, given the description above, it seems perfectly suited to the task.
Our mission, should we choose to accept it, is to solve the SEND + MORE = MONEY problem in code. No, hold on, let me put it like this instead:
S E N D
+ M O R E
-----------
M O N E Y
It means the same, but putting it up like this is more visually evocative, especially since many of us did it this way in school.
The ground rules are simple.
Each letter represents a digits between 0 and 9.
The letters represent distinct digits; two letters may not share the same digit.
Leading digits (in our puzzle, S and M) can’t be zero. Then they wouldn’t be leading digits!
Given these constraints, there’s a unique solution to the puzzle above.
I encourage you to find the solution. Write a bit of code, live a little! In this post, we’ll do that, but then (crucially) not be satisfied with that, and end up in a nested-doll situation where code writes code until something really neat emerges. The conclusion will spell out the ultimate vision — hold on, I’m being informed in real-time by the Plurality Committee that the correct term is “an ultimate vision” — for Perl 6.
Let’s do this.
Marcus Junius Brute Force (The Younger)
Our first language of the day, with its corresponding solution, is Perl 6 itself. There’s no finesse here; we just charge right through the solution space like an enraged bull, trying everything. In fact, we make sure not to attempt any cleverness with this one, just try to express the solution as straightforwardly as possible.
Again, it’s not pretty, but it works. This is the kind of indentation level your mother warned you about. If you ask me, though, I’m more annoyed about the indentation being there at all. We have one for every variable whose search space we need to scan through. (Only with Y do we get to take a shortcut.)
Though it’s a detour for today’s buffet, MJD once blogged about this and then I blogged about it too. Those blog posts were very much about “removing the indentation”, in a sense. Today’s post is where my thinking has taken me, three years later.
I took the path less traveled (and all the other paths, too)
Our second language is still mostly Perl 6, but with a neat hypothetical extension called amb, but spelled (evocatively) <-. It gets rid of all the explicit for loops and levels of indentation.
This solution is shorter, more compact, and feels less “noisy” and aggravating just by ridding us of the for loops. (I suspect this has something to do with that imperative↔declarative spectrum people mention sometimes. We’re not so interested in looping as such, only seeing it get done.)
I know it won’t completely make up for the fact that Perl 6 doesn’t have the amb operator and guard implemented in core (or even in module space), but here’s a short script that will convert the above program to today’s first version:
my $indent =0;
constant SPACE =chr(0x20);
sub indent { SPACE x4* $indent }
forlines() {
when /^ my \h+ ('$'\w) \h*'<-'\h* (\d+\h*'..'\h*\d+) ';'$/ {
say indent, "for $1 -> int $0 \{";
$indent++;
}
when /^ guard \h+ ('$'\w) \h*'!='\h*'any(' ('$'\w)+% [\h*','\h*] ')'\h*';'$/ {
say indent, "next if $0 == $_;"for$1;
say"";
}
when /^ guard \h+ ([<!before'=='> .]+'==' <-[;]>+) ';'$/ {
say indent, "next unless $0;";
}
when /^ my \h+ ('$'\w+) \h*'='\h* (<-[;]>+) ';'$/ {
say indent, "my int $0 = $1;";
}
when /^\h*$/ {
say"";
}
when /^ say \h+ (<-[;]>+) ';'$/ {
say indent, $_;
}
default {
die"Couldn't match $_";
}
}
while $indent-- {
say indent, "\}";
}
But we’ll not be satisfied here either. Oh no.
Thinking in equations
The third language takes us even further into the declarative, getting rid of all the guard clauses that simply state that the variables should be distinct.
ALL_DISTINCT
$d in 0..9
$e in 0..9
$n in 0..9
$r in 0..9
$o in 0..9
$s in 1..9
$m in 1..9
$y = ($d + $e) % 10
$_c1 = ($d + $e) div 10
($_c1 + $n + $r) % 10 == $e
$_c2 = ($_c1 + $n + $r) div 10
($_c2 + $e + $o) % 10 == $n
$_c3 = ($_c2 + $e + $o) div 10
($_c3 + $s + $m) % 10 == $o
$_c4 = ($_c3 + $s + $m) div 10
$_c4 % 10 == $m
We’re completely in the domain of constraint programming now, and it would be disingenuous not to mention this. We’ve left the imperative aspects of Perl 6 behind, and we’re focusing solely on describing the constraints of the problem we’re solving.
The most imperative aspect of the above program is when we do an assignment. Even this is mostly an optimization, in the cases when we know we can compute the value of a variable directly instead of searching for it.
Even in this case, we could translate back to the previous solution. I’ll leave out such a translator for now, though.
I’m going to come back to this language in the conclusion, because it turns out in many ways, it’s the most interesting one.
The fourth language
Having gotten this far, what more imperative complexity can we peel off? Specifically, where do those equations come from that are specified in the previous solution? How can we express them more succinctly?
You’ll like this, I think. The fourth language just expresses the search like this:
S E N D
+ M O R E
-----------
M O N E Y
Hang on, what again? Yes, you read that right. The most declarative solution to this problem is just an ASCII layout of the problem specification itself! Don’t you just love it when the problem space and the solution space meet up like that?
From this layout, we can again translate back to the constraint programming solution, weaving equations out of the manual algorithm for addition that we learn in school.
So, not only don’t we have to write those aggravating for loops; if we’re tenacious enough, we can have code generation all the way from the problem to the solution. We just need to find the appropriate languages to land on in-between.
Conclusion
My exploration with 007 has led me to think about things like the above: translating programs. Perl 6 already exposes one part of the compilation process very well: parsing. We can use grammars both in userland and within the Perl 6 toolchain itself.
I’ve come to believe we need to do that to all aspects of the compilation pipeline. Here, let me put it as a slogan or a declaration of sorts:
Perl 6 will have reached its full potential when all features we bring to bear manipulating text/data can also be turned inwards, to the compilation process itself.
Those translators I wrote (or imagined) between my different languages, they work in a pinch but they’re also fragile and a bit of a waste. The problem is to a large extent that we drop down to text all the time. We should be doing this at the AST level, where all the structure is readily available.
The gains from such a mind shift cannot be overstated. This is where we will find Lispy enlightenment in Perl 6.
For example, the third language with the equations doesn’t have to be blindly translated into code. It can be optimized, the equations massaged into narrower and more precise ones. As can be seen on Wikipedia, it’s possible to do such a good job of optimizing that there’s no searching left once the program runs.
My dream: to be able to do the above transformations, not between text files but between slangs within Perl 6. And to be able to do the optimization step as well. All without leaving the comfort of the language.
However, as an elf I know once quoted, ‘It’s hard to tell the difference between mastered technique and magic’. So the mystery can be resolved? Let’s glance over the way it all works.