How to produce and use datasets: lessons learned

Various studies have focused on the complexities of publishing and using (open) data. A number of lessons can be learned from the experiences of (governmental) data providers, policy-makers, data users, entrepreneurs, competitors and researchers.

Data can be provided by the government, crawled from the web, or generated by sensors. Here are 50 lessons learned in the form of tips and guidelines on creating and using high-quality open datasets.

Continue reading

Patient MEG scan

Predict visual stimuli from human brain activity

Kaggle is hosting a contest where the task is to predict visual stimuli from magnetoencephalography (MEG) recordings of human brain activity. A subject is presented a stimulus (a human face or a distorted face) and the concurrent brain activity is recorded. The relation between the recorded signal and the stimulus may provide insights on the underlying mental process. We use Vowpal Wabbit to beat the benchmark.


Go to the Kaggle competition page to read the full description.

We have the data for 23 participants in the study. All participants have completed around 580 trials. There are time series of brain activity (starting 0.5 seconds before the stimulus is presented, a total of 375 bins). There are 306 channels (from the MEG sensors).

Labels are either 1 (a human face) or 0 (a distorted face). We have the labels for the trials of 16 participants (the train set). We have to predict the labels for the trials of 7 participants (the test set).

Continue reading

Shopping cart

Predicting repeat buyers using purchase history

Another Kaggle contest means another chance to try out Vowpal Wabbit. This time on a data set of nearly 350 million rows. We will discuss feature engineering for the latest Kaggle contest and how to get a top 3 public leaderboard score (~0.59347 AUC).

A short competition description

The competition is to predict repeat buyers (those who redeem a coupon and purchase that product afterwards). For this we have the labelled data (did become repeat buyer, did not become repeat buyer) for about 150.000 shoppers (the train set).

Our task is to predict the labels for about 150.000 other shoppers (the test set). For this we can use a file called transactions.csv. It’s a huge file (unzipped about 22GB) containing nearly 350 million rows. The total amount spend in the transaction data is nearing 1.5 billion.

Continue reading

illustration rabbits

Install Vowpal Wabbit on Windows and Cygwin

There are already instructions on how to install Vowpal Wabbit on other operating systems, but we could not find a clear one for Windows. We will use Cygwin to install the latest version of Vowpal Wabbit.

Thanks to reader Dominic for providing useful feedback! Guide is now updated and suitable for both 32-bit and 64-bit systems.

Thanks to readers Ray, Phil Culliton, dou, Christophe, Fred and Brian guide is updated for the latest version of Vowpal Wabbit.

Update (2017): This guide has not been updated for a while, so may be missing dependancies. I suggest you look at Vowpal Wabbit Releases for Windows MSI installers of the latest versions. Thanks to Markus Cozowicz!

Install Cygwin

Download the version of Cygwin for your operating system from:

Continue reading

matryoska doll

k-Nearest Neighbors and Clustering on Compressed Binary Files

Normalized Compression Distance (Cilibrasi & Vitanyi) returns a similarity measure between binary files. This similarity measure allows for nearest neighbors search, clustering and classification. We are going to try some of these methods and review the results.

This is a draft version.

Normalized Compression Distance

In the post about normalized compression distance on chess games we already touched on the subject of NCD. The paper “Clustering by Compression [pdf]” contains a formal introduction.

Continue reading

Google Streetview Road Accident

A Clustered Google Maps of 10k Dutch Traffic Accidents

The Open Data Portal is a website made by the Dutch Government. It includes a data set of all registered traffic accidents in the province of North-Holland from 2005 to 2009. We munge the data and place it all on a Google Maps. We use MarkerClusterer to deal with the 10k+ markers.

Fullscreen map

The Open Data Portal

The Dutch have an Open Data Portal with data sets. These data sets come from a variety of government institutions.
Continue reading

girl throws tomatoes

Movie Review Sentiment Analysis with Vowpal Wabbit

Kaggle is hosting another cool knowledge contest, this time it is sentiment analysis on the Rotten Tomatoes Movie Reviews data set. We are going to use Vowpal Wabbit to test the waters and get our first top 10 leaderboard score.

Contest Description


The Rotten Tomatoes movie review data set is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [pdf]. In their work on sentiment treebanks, Socher et al. [pdf] used Amazon’s Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus.

The train and test data sets are tab-separated files with phrases from the Rotten Tomatoes data set.

Each sentence has been parsed into many phrases by the Stanford parser.

Continue reading


Free Course: Data Mining With Weka

Everybody talks about Data Mining and Big Data nowadays. Weka is a powerful, yet easy to use tool for machine learning and data mining. This free course by the university of Waikato introduces you to practical data mining with Weka.

Course details

The course features:

The 5-week course starts on 3rd March 2014.

Continue reading


Multi-Armed Bandit Algorithms Made Easy

Multi-Armed Bandit problems have always been viewed as devilishly hard problems. It is rumored that the allies dropped this problem over WWII Germany to occupy their analysts.

Multi-armed Bandit problems

The maths behind Multi-Armed Bandit (MAB) problems is incredibly complex. Peter Whittle had this to say about them in 1979:

[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.

Due to the complex maths, many Multi-Armed Bandit use heuristics to approach a good solution.

Continue reading


Tutorial: Online LDA with Vowpal Wabbit

A not well-known feature of Vowpal Wabbit is Online Latent Dirichlet Allocation. This allows you to do topic modelling on millions of documents in under an hour.

Under construction. Come back soon.

Latent Dirichlet Allocation

Topic modeling

Probabilistic topic models are useful for uncovering the underlying semantic structure of a collection of documents.

Latent Dirichlet Allocation (LDA) is a hierarchical Bayesian model that explains the variation in a set of documents in terms of a set of K latent “topics”.

Now if the description above instantly makes you go glazy-eyed, I will not blame you. Explaining Latent Dirichlet Allocation in layman’s terms is a pretty daunting task and not within the scope of this article (and admittedly not fully within the scope of my own skills).


LDA is unsupervised: you do not need labelled samples. Just a corpus of text documents will do.

After running LDA we end up with a number of unnamed topics, each containing tokens related to that topic.


The corpus can be general (web) text, or specific, like the books from a single author, logs from an IRC channel, or all the posts on some forum.

Sample output

Sample output from LDA may look like:

Topic 20:

Topic 21:

Online learning with LDA

Matt Hoffman first described a method to do batch/online learning with LDA in a 2010 NIPS paper.


In-memory LDA calculations are very slow and memory greedy. Any corpus over ~200.000 documents is (technically) very hard to train LDA on.

Matt Hoffman’s approach (gradient fitting on batches) is fast, memory friendly and performs near equal to in-memory LDA fitting.


Matt released a Python version of his online LDA algorithm, and luckily for us, wrote a C++ implementation for Vowpal Wabbit too. The way I see it: Online LDA and VW are a match made in heaven.

Algorithm description

In pseudo-code:

Until converged:
  Choose a mini-batch of documents randomly
  For each document in that mini-batch:
    Estimate approximate posterior over what 
    topics each word in each document came from
  (Partially) update approximate posterior over topic 
  distributions based on which words are believed to 
  have come from which topics

Vowpal Wabbit & LDA

Input format

The input format for LDA in Vowpal Wabbit is different from regular labelled train and test sets. With LDA:

  • No namespaces
  • No labels
  • No id’s

Basically it is just a pipe (|) followed by the document words (optionally followed by a count).

| this is a document about sports
| here:1 another:2 two:2 documents:2 about:2 sports:2
| the:17 epic:5 of:18 gilgamesh:6 [...]

Corpus: StackOverflow posts

For the corpus we are going to use the StackOverflow corpus as used in the Facebook Recruiting Challenge III hosted by Kaggle.

We take the test set (~2.3GB of text) and munge it to Vowpal Wabbit format. We use the Python script below to clean the text:

from csv import DictReader
import re

loc_csv = "d:\\downloads\\test\\test.csv"
#will be created:
loc_vw_train = "d:\\stackoverflow.lda.vw" 

def clean(s):
  #Text cleaning function
    return " ".join(re.findall(r'\w+',s,flags = re.UNICODE | re.LOCALE)).lower()
    return "not_a_valid_value"

def to_vwlda(loc_csv, loc_vw_train):
  #Open outfile to write VW format
  with open(loc_vw_train,"wb") as outfile:
    #For every enumerated row {} in csv file
    for e, row in enumerate( DictReader(open(loc_csv,"rb"))):
      #Write document title and body to VW format
      outfile.write("| %s\n" % clean(row['Title']))
      if e % 10000 == 0:
#to_vwlda(loc_csv, loc_vw_train)

We now have a VW-LDA formatted file stackoverflow.lda.vw. It contains 2,013,336 question titles, each on a single line:

| getting rid of site specific hotkeys [...]

Now let’s fit a topic model on this dataset.

Note: We could add the body text, but then you would need to generate and inspect far more topics. With the title text only, some topics may still appear to be a bit unclear or incoherent, but overall solid topics are emerging. See also the second video below this post: “Use sentences for mini-documents”.

Creating topics

Fitting and outputting result

We train Vowpal Wabbit 7.6.1 on Windows 64-bit in Cygwin.

./vw -d stackoverflow.lda.vw --lda 20 --lda_D 2013336 --readable_model lda.model.vw


  • ./vw is our executable
  • -d stackoverflow.lda.vw specifies our dataset
  • --lda 20 says to generate 20 topics
  • --lda_D 2013336 specifies the number of documents in our corpus. In a truly online setting, this can be an estimate of the maximum number of documents that could ever be seen.
  • --readable_model lda.model.vw stores our topic model in a readable format. Unfortunately –invert_hash does not give us the original tokens, but with the readable model and a little script, we can still get back the tokens.

Note: It’s important that you explicitly specify -d for your dataset, else other LDA commands may throw an error.

One could also modify the following LDA-related parameters:

  • --lda_alpha. A hyperparameter for the prior on weight vectors theta.
  • --lda_eta. A hyperparameter for the prior on the topics beta.
  • --power_t (kappa) and --initial_t (tau0). For scheduling learning stepsize.
  • --minibatch. The size of the batch (number of documents processed in chunks)
  • b The bitsize indicates how much tokens are stored in the model. Standard VW bitsize is 18, so 2^18 = 262,144 tokens
  • -c -k --passes 10. To use a cache, kill previous cache, and run 10 passes.

Turning output into readable model

A readable model for 5 topics now looks like

Version 7.6.1
Min label:0.000000
Max label:1.000000
0 pairs: 
0 triples: 
0 ngram: 
0 skip: 
0 0.100328 33306.531250 0.101609 0.100231 0.100696
1 0.100302 0.100571 0.102158 0.100313 164633.640625
262142 1168.760742 0.100535 0.106653 0.100212 0.107155
262143 0.100223 990.998352 0.100993 0.100040 0.101263

We are interested in everything after options:

First is the token id, so 0. This is the first unique token in the dataset.

Then follow the distances to 5 topics. For “token id 0” it is closest to the second topic (33306.531250).

We use the Python script to transform a readable_model file into topics. For more attractive visualizations, one could output a .json file for use in d3.js.


Some topics that formed:

Topic 1
0.997 printf
0.997 sizeof
0.996 characters
0.996 character
0.995 endl
0.995 stdio
0.994 iostream
0.993 cout
0.992 unsigned
0.991 malloc
0.991 typedef
0.991 cin
0.991 argc
0.989 size_t
0.988 len
0.988 std
0.986 unicode
0.986 ascii
0.986 fprintf
0.986 scanf

Topic 2
0.999 img
0.999 div
0.999 width
0.999 height
0.999 png
0.999 jquery
0.999 alt
0.999 imgur
0.999 css
0.999 border
0.999 margin
0.998 1px
0.998 color
0.998 jsfiddle
0.998 0px
0.998 getelementbyid
0.998 addsubview
0.998 jpg
0.998 alloc
0.998 cgrectmake

Topic 3
1.0 about
1.0 question
1.0 we
1.0 looking
1.0 best
0.999 good
0.999 since
0.999 better
0.999 say
0.999 their
0.999 wondering
0.999 most
0.999 computer
0.999 such
0.999 our
0.999 were
0.999 own
0.999 really
0.999 might
0.999 think

Topic 4
0.997 eventargs
0.996 mysql_query
0.996 linq
0.996 varchar
0.995 actionresult
0.995 ienumerable
0.995 lastname
0.995 firstname
0.994 tolist
0.994 entity
0.994 writeline
0.993 sqlcommand
0.993 dbo
0.993 user_id
0.993 binding
0.992 userid
0.992 datatable
0.992 databind
0.991 byval
0.991 connectionstring

Further reading, notes, todo’s

Github LDA wiki
In-depth LDA presentation with interesting grouping of topics into topic chains: