hxXJDq3pF242SYyLlJ0UOYzGluT[1]

Tutorial: Online LDA with Vowpal Wabbit

A not well-known feature of Vowpal Wabbit is Online Latent Dirichlet Allocation. This allows you to do topic modelling on millions of documents in under an hour.

Under construction. Come back soon.

Latent Dirichlet Allocation

Topic modeling

Probabilistic topic models are useful for uncovering the underlying semantic structure of a collection of documents.

Latent Dirichlet Allocation (LDA) is a hierarchical Bayesian model that explains the variation in a set of documents in terms of a set of K latent “topics”.

Now if the description above instantly makes you go glazy-eyed, I will not blame you. Explaining Latent Dirichlet Allocation in layman’s terms is a pretty daunting task and not within the scope of this article (and admittedly not fully within the scope of my own skills).

Unsupervised

LDA is unsupervised: you do not need labelled samples. Just a corpus of text documents will do.

After running LDA we end up with a number of unnamed topics, each containing tokens related to that topic.

Corpus

The corpus can be general (web) text, or specific, like the books from a single author, logs from an IRC channel, or all the posts on some forum.

Sample output

Sample output from LDA may look like:

[...]
Topic 20:
game
games
play
ball
[...]
played
goal
card
minutes

Topic 21:
university
college
research
professor
[...]
department
study
academy
sciences
[...]

Online learning with LDA

Matt Hoffman first described a method to do batch/online learning with LDA in a 2010 NIPS paper.

Benefits

In-memory LDA calculations are very slow and memory greedy. Any corpus over ~200.000 documents is (technically) very hard to train LDA on.

Matt Hoffman’s approach (gradient fitting on batches) is fast, memory friendly and performs near equal to in-memory LDA fitting.

Implementations

Matt released a Python version of his online LDA algorithm, and luckily for us, wrote a C++ implementation for Vowpal Wabbit too. The way I see it: Online LDA and VW are a match made in heaven.

Algorithm description

In pseudo-code:

Until converged:
  Choose a mini-batch of documents randomly
  For each document in that mini-batch:
    Estimate approximate posterior over what 
    topics each word in each document came from
  (Partially) update approximate posterior over topic 
  distributions based on which words are believed to 
  have come from which topics

Vowpal Wabbit & LDA

Input format

The input format for LDA in Vowpal Wabbit is different from regular labelled train and test sets. With LDA:

  • No namespaces
  • No labels
  • No id’s

Basically it is just a pipe (|) followed by the document words (optionally followed by a count).

| this is a document about sports
| here:1 another:2 two:2 documents:2 about:2 sports:2
| the:17 epic:5 of:18 gilgamesh:6 [...]

Corpus: StackOverflow posts

For the corpus we are going to use the StackOverflow corpus as used in the Facebook Recruiting Challenge III hosted by Kaggle.

We take the test set (~2.3GB of text) and munge it to Vowpal Wabbit format. We use the Python script below to clean the text:

from csv import DictReader
import re

loc_csv = "d:\\downloads\\test\\test.csv"
#will be created:
loc_vw_train = "d:\\stackoverflow.lda.vw" 

def clean(s):
  #Text cleaning function
  try:
    return " ".join(re.findall(r'\w+',s,flags = re.UNICODE | re.LOCALE)).lower()
  except:
    return "not_a_valid_value"

def to_vwlda(loc_csv, loc_vw_train):
  #Open outfile to write VW format
  with open(loc_vw_train,"wb") as outfile:
    #For every enumerated row {} in csv file
    for e, row in enumerate( DictReader(open(loc_csv,"rb"))):
      #Write document title and body to VW format
      outfile.write("| %s\n" % clean(row['Title']))
      #Report
      if e % 10000 == 0:
        print(e)
    print(e)
    
#to_vwlda(loc_csv, loc_vw_train)

We now have a VW-LDA formatted file stackoverflow.lda.vw. It contains 2,013,336 question titles, each on a single line:

| getting rid of site specific hotkeys [...]

Now let’s fit a topic model on this dataset.

Note: We could add the body text, but then you would need to generate and inspect far more topics. With the title text only, some topics may still appear to be a bit unclear or incoherent, but overall solid topics are emerging. See also the second video below this post: “Use sentences for mini-documents”.

Creating topics

Fitting and outputting result

We train Vowpal Wabbit 7.6.1 on Windows 64-bit in Cygwin.

./vw -d stackoverflow.lda.vw --lda 20 --lda_D 2013336 --readable_model lda.model.vw

Where:

  • ./vw is our executable
  • -d stackoverflow.lda.vw specifies our dataset
  • --lda 20 says to generate 20 topics
  • --lda_D 2013336 specifies the number of documents in our corpus. In a truly online setting, this can be an estimate of the maximum number of documents that could ever be seen.
  • --readable_model lda.model.vw stores our topic model in a readable format. Unfortunately –invert_hash does not give us the original tokens, but with the readable model and a little script, we can still get back the tokens.

Note: It’s important that you explicitly specify -d for your dataset, else other LDA commands may throw an error.

One could also modify the following LDA-related parameters:

  • --lda_alpha. A hyperparameter for the prior on weight vectors theta.
  • --lda_eta. A hyperparameter for the prior on the topics beta.
  • --power_t (kappa) and --initial_t (tau0). For scheduling learning stepsize.
  • --minibatch. The size of the batch (number of documents processed in chunks)
  • b The bitsize indicates how much tokens are stored in the model. Standard VW bitsize is 18, so 2^18 = 262,144 tokens
  • -c -k --passes 10. To use a cache, kill previous cache, and run 10 passes.

Turning output into readable model

A readable model for 5 topics now looks like

Version 7.6.1
Min label:0.000000
Max label:1.000000
bits:18
0 pairs: 
0 triples: 
rank:0
lda:10
0 ngram: 
0 skip: 
options:
0 0.100328 33306.531250 0.101609 0.100231 0.100696
1 0.100302 0.100571 0.102158 0.100313 164633.640625
...
262142 1168.760742 0.100535 0.106653 0.100212 0.107155
262143 0.100223 990.998352 0.100993 0.100040 0.101263

We are interested in everything after options:

First is the token id, so 0. This is the first unique token in the dataset.

Then follow the distances to 5 topics. For “token id 0” it is closest to the second topic (33306.531250).

We use the Python script to transform a readable_model file into topics. For more attractive visualizations, one could output a .json file for use in d3.js.

Results

Some topics that formed:

Topic 1
0.997 printf
0.997 sizeof
0.996 characters
0.996 character
0.995 endl
0.995 stdio
0.994 iostream
0.993 cout
0.992 unsigned
0.991 malloc
0.991 typedef
0.991 cin
0.991 argc
0.989 size_t
0.988 len
0.988 std
0.986 unicode
0.986 ascii
0.986 fprintf
0.986 scanf

Topic 2
0.999 img
0.999 div
0.999 width
0.999 height
0.999 png
0.999 jquery
0.999 alt
0.999 imgur
0.999 css
0.999 border
0.999 margin
0.998 1px
0.998 color
0.998 jsfiddle
0.998 0px
0.998 getelementbyid
0.998 addsubview
0.998 jpg
0.998 alloc
0.998 cgrectmake

Topic 3
1.0 about
1.0 question
1.0 we
1.0 looking
1.0 best
0.999 good
0.999 since
0.999 better
0.999 say
0.999 their
0.999 wondering
0.999 most
0.999 computer
0.999 such
0.999 our
0.999 were
0.999 own
0.999 really
0.999 might
0.999 think

Topic 4
0.997 eventargs
0.996 mysql_query
0.996 linq
0.996 varchar
0.995 actionresult
0.995 ienumerable
0.995 lastname
0.995 firstname
0.994 tolist
0.994 entity
0.994 writeline
0.993 sqlcommand
0.993 dbo
0.993 user_id
0.993 binding
0.992 userid
0.992 datatable
0.992 databind
0.991 byval
0.991 connectionstring

Further reading, notes, todo’s

Github LDA wiki
In-depth LDA presentation with interesting grouping of topics into topic chains:

14 thoughts on “Tutorial: Online LDA with Vowpal Wabbit”

  1. Thank you for the feedback! I will finish this post next. Give me a day or two and I’ll finish it (I will post it as a new article). Oh Yeah!

  2. Sorry if this is a dupe; can’t remember if I clicked “Post Comment” or not. Anyway, I’m interested in this; in particular, I’d like to see what parameters you fed to vw.

    1. Hi, thanks for this, very helpful.
      I think there is an error, when you show the topics, with the words and the probabilities, the sum of all probabilities of a given topic should be 1 and it is not. I am sure about this, because LDA outputs p(word|topic) distribution for each topic.

      Now, I am not sure about how vw computes this, but the output are not probabilites, why did you normalize by row? Maybe you need to normalize by column, but, about this I don’t know. Probably, both computations are correct, but, normalizing by columns would give you, p(w|t), whilst normalizing by row would give you p(t|w). Topics should be built using p(w|t).

      Also, watch for the hashing trick, I noticed vw will generate the probabilities for the whole hashing space, so if you have only 2 terms, and use 2 bits, you will find four lines in the readable model output, 3 bits, 8 lines, and so on. So, if you want to match appearance order of terms to lines in the ouptut file, make sure you use a bigger hashing space than amount of terms (to avoid collisions), and also, pass –noconstant, so that the bias term is not used and therefore avoiding a colision between the bias term and one of the words.

      1. Thank you for the feedback, very helpful here too, since I am also just figuring all this out.

        I think you are correct in the “mistake” of normalizing per row. It did generate pleasing topics, so at least it is working to some degree, but I am not known for theoretically sound solutions :).

        I am going to finish this post very soon, and perhaps show the difference in topics that emerge when normalizing by column or by row.

  3. Hi,
    thanks very much for this helpful tutorial. Keep up this type of posts!

    In order to specify the hyper-parameter for the prior on the topics beta, I think you should use –lda_rho instead of –lda_eta.

    If you finished it, could you please post the python script that converts the word ids to real words?

  4. One more person asking for the python script… Or I am gladly willing to write a similar one, provided I understand how to look back the words in the topics? Have to admit I don’t see how to look back into the document for them.

    Thank you.

  5. You can score the training data with the –audit option to get the mapping from hash values to tokens.
    vw -i model.dat -t /dev/null |tr ‘^’ ‘\n’ | grep : | sort -t: -k1,1 -u > output.txt
    You can load the output into MySQL, for example, and compute various probabilities there. (The above command filters duplicate words but it’s more correct to filter duplicate hashes).

Leave a Reply

Your email address will not be published. Required fields are marked *