girl throws tomatoes

Movie Review Sentiment Analysis with Vowpal Wabbit

Kaggle is hosting another cool knowledge contest, this time it is sentiment analysis on the Rotten Tomatoes Movie Reviews data set. We are going to use Vowpal Wabbit to test the waters and get our first top 10 leaderboard score.

Contest Description


The Rotten Tomatoes movie review data set is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [pdf]. In their work on sentiment treebanks, Socher et al. [pdf] used Amazon’s Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus.

The train and test data sets are tab-separated files with phrases from the Rotten Tomatoes data set.

Each sentence has been parsed into many phrases by the Stanford parser.

A quick glance of the raw training data set shows us:

3592  134  It 's hard to say who might enjoy this  0
3593  134  's hard to say who might enjoy this  1
3594  134  hard to say who might enjoy this  0
3595  134  to say who might enjoy this  2
3596  134  say who might enjoy this  2
3597  134  who might enjoy this  3
3598  134  might enjoy this  3
3599  134  enjoy this  3

Where the header is:

PhraseId	SentenceId	Phrase	Sentiment

Task 1: Start by downloading the data for yourself. The files are small enough for this competition to open them up in your favorite text editor. Skim the data sets to get a feel for the data.


Every phrase has a label describing the sentiment. The labels are:

  • 0 – negative
  • 1 – somewhat negative
  • 2 – neutral
  • 3 – somewhat positive
  • 4 – positive

To build a model that can classify multiple labels we need a multiclass approach (PDF lecture). Vowpal Wabbit can reduce the multiclass problem to as many binary classification problems as there are classes.


Your submission to Kaggle is evaluated on plain accuracy (the percentage of correctly predicted labels). Submissions with predicted labels should follow the standard Kaggle format:


Vowpal Wabbit

Vowpal Wabbit is an incredibly powerful multi-purpose tool. Originally created by John Langford at Yahoo, the project is currently sponsored by Microsoft Research.

Introducing Vowpal Wabbit deserves an entire article or even a book. In this article we only show how to run Vowpal Wabbit (maybe for the first time) and a short explanation of the settings/parameters used.

Input Format

Vowpal Wabbit has a very flexible and human-readable input format. It can handle raw text (no need to vectorize it).

An example training set to classify animals (label 1) from non-animals (label -1) could look like:

1 'horse |f color_brown avg_age:6.5 has_legs:4 a creature on the farm |a wikipedia_mentions:15
-1 'oak |f color_brown avg_age:75 prospers near ponds and lakes |a wikipedia_mentions:5

Where 1 is the label, 'horse is the identifier, |f and |a are feature spaces (useful to create feature pairs with -q fa or to ignore certain features --ignore a).

Features themselves can be raw or pre-processed text ...creature on the.... You can add a weight by appending a : followed by a float or int. If absent the weight is assumed to be :1.

If there are multiple labels, like in this contest, Vowpal Wabbit expects labels to be positive integers, starting from 1. 5 labels (very negative, negative, neutral, positive, very positive) could be identified as [1,2,3,4,5].

Task 2: Get Vowpal Wabbit up and running on your machine. You should download the latest version (7.4 as of writing) and build it. You could build Vowpal Wabbit on Windows machines with CygWin.

For convenience sake we’ve included a stand-alone Windows executable of Vowpal Wabbit 7.1 in the  GitHub repo accompanying this article. Download the build, download the executable, place it in the /vowpalwabbit/ directory and you should be good to go.

Try running Vowpal Wabbit from the command prompt and you should see all available command line options.

Data Munging & Feature Generation

We can’t feed Vowpal Wabbit tab separated files without any context.  We need to turn the raw data sets into a Vowpal Wabbit-friendly format first.

In this data munging step we also decide on what our features will be. For this article our features will be the words of a phrase and the length of the phrase.

Python is perfect for data munging. You can quickly whip up a script to transform raw data any way you want. To turn train.tsv into train.vw:

import csv
import re

location_train = "kaggle_rotten\\train.tsv"
location_test = "kaggle_rotten\\test.tsv"

location_train_vw = "rotten.train.vw" #will be created
location_test_vw = "rotten.test.vw" #will be created

#cleans a string "I'm a string!?" returns as "i m a string"
def clean(s):
  return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower()

#creates Vowpal Wabbit-formatted file from tsv file
def to_vw(location_input_file, location_output_file, test = False):
  print "\nReading:",location_input_file,"\nWriting:",location_output_file
  with open(location_input_file) as infile, open(location_output_file, "wb") as outfile:
    #create a reader to read train file
    reader = csv.DictReader(infile, delimiter="\t")
    #for every line
    for row in reader:
      #if test set label doesnt matter/or isnt available
      if test:
        label = "1"
        label = str(int(row['Sentiment'])+1)
	  phrase = clean(row['Phrase'])
      outfile.write(   label + 
          " '"+row['PhraseId'] + 
          " |f " + 
          phrase + 
          " |a " + 
          "word_count:"+str(phrase.count(" ")+1)
          + "\n" )

to_vw(location_train, location_train_vw)
to_vw(location_test, location_test_vw, test=True)

Task 3: Run the above script or run your own command-line magic to transform the .tsv data sets into .vw data sets. You should create a training set and a test set in Vowpal Wabbit format.

The train set should now look like:

4 '22 |f good for the goose |a word_count:4
4 '23 |f good |a word_count:1
3 '24 |f for the goose |a word_count:3

and the test set should look like:

1 '156071 |f mostly routine |a word_count:2
1 '156072 |f mostly |a word_count:1
1 '156073 |f routine |a word_count:1

Creating a Model

Now we have our data sets in the correct format we can create our model with Vowpal Wabbit.

The command we will use to create a model:

vw rotten.train.vw -c -k --passes 300 --ngram 7 -b 24 --ect 5 -f rotten.model.vw


  • vw is the Vowpal Wabbit executable
  • rotten.train.vw is our train set
  • -c -k means to use a cache for multiple passes, and kill any existing cache
  • --passes 300 means to make 300 passes over our data set
  • --ngram 7 tells Vowpal Wabbit to create n-grams (7-grams in this case).
  • -b 24 tells Vowpal Wabbit to use 24-bit hashes (18-bit hashes is default)
  • -f rotten.model.vw means “save model as ‘rotten.model.vw'”.
  • --ect (error correcting tournament [pdf]) in very simple terms tells Vowpal Wabbit that there are 5 possible labels and we want it to pick one.


Multiple passes over the data allows Vowpal Wabbit to better fit its model.

n-grams increase performance because a phrase like “this movie was not good” would score positive sentiment for the token “good”. If 2-grams were used the model could detect negative sentiment in the token “not good”.

Vowpal Wabbit is so incredibly fast in part due to the hashing trick. With many features and a small-sized hash collisions start occurring. These collisions may influence the results. Often for the worse, but not necessarily: Multiple features sharing the same hash can have a PCA-like effect of dimensionality reduction.

One could also use --oaa (one against all) instead of --ect (error correcting tournament) but “ect” at times outperforms “oaa”. In this case it produces a lower average loss.

Task 4: Create a model from the training set. Play around with the different settings. Can you make a model with a lower average loss?

Making Predictions on the Test set

Now we have our model we can tell Vowpal Wabbit to predict the labels for our test set.

To run Vowpal Wabbit in test mode and create predictions:

vw rotten.test.vw -t -i rotten.model.vw -p rotten.preds.txt


  • vw is the Vowpal Wabbit executable
  • rotten.test.vw the location to our test set
  • -t tells to test only (no learning)
  • -i rotten.model.vw says to use rotten.model.vw as the model
  • -p rotten.preds.txt means “save predictions as ‘rotten.preds.txt'”

We now have a file rotten.preds.txt with the predictions:

4.000000 156062
3.000000 156063
4.000000 156064

We just need to transform these to the Kaggle submission format:


Note that we have to subtract 1, because we changed label 0 to label 1 when changing to Vowpal Wabbit format.

The following script could make the Kaggle submission file:

import csv
def to_kaggle(location_input_file, header="", location_output_file="kaggle.submission.csv"):
  print "\nReading:",location_input_file,"\nWriting:",location_output_file
  with open(location_input_file) as infile, open(location_output_file, "wb") as outfile:
    if len(header) > 0:
	  outfile.write( header + "\n" )
    reader = csv.reader(infile, delimiter=" ")
    for row in reader:
      outfile.write( row[1] + "," + str(int(row[0][0])-1) + "\n" )

to_kaggle("rotten.preds.txt", "PhraseId,Sentiment")

Task 5: Run the test set on your model and create a predictions file. Then transform the predictions file to Kaggle Submission format. Then hurry and make a top 10 submission on Kaggle!

And there you have it: Simple, fast sentiment analysis with a linear solver. You can find the code accompanying this article on GitHub. If we find any improvements we will update the code. If you found an improvement be sure to let us know in the comments or on the competition forums.


Vowpal Wabbit can give a quick overview of feature relevance. You can do this with vw-varinfo (a perl script).

Running vw-varinfo gives output like this:

FeatureName        HashVal      Weight   RelScore
f^remarkable       176330     +1.1790     94.01%
f^brilliant        158704     +1.1363     90.60%
f^terrific         232040     +1.1352     90.51%
f^lacks            145204     -1.1340    -90.42%
f^failure          157645     -1.2183    -97.14%
f^worst            259228     -1.2542   -100.00%

We are interested in the RelScore and the FeatureName. With this and pylab we can plot the feature relevance histograms below:

Feature relevance top 100


ML Wave would like to thank you, Kaggle, Rotten Tomatoes, the contestants and anyone who contributes to machine learning research theory and tools. If you are from dt and you’ve read this far, great!


  • Cross-validation.
  • Feature selection.
  • Grid search.
  • Implement Stanford NLP Sentiment package

4 thoughts on “Movie Review Sentiment Analysis with Vowpal Wabbit”

  1. Hi – I have been going through some of your posts regarding Vowpal Wabbit. Very interesting reading, thanks for sharing!

    One thing you mention in this post is Cross-Validation in your todo list. For the moment I have seen some wrappers (vowpal_porpoise in python, RVowpalWabbit in R), but I wanted to ask you if you found since then any other efficient way of doing Cross-Validation with Vowpal-Wabbit?

    I have also briefly tried Vowpal-Wabbit on a Kaggle competition, and at the time I only relied on its holdout mechanism.


  2. Hi,

    what command line are you using to save the vw-varinfo output to transfer into python in order to create the visual?

    Thank you,

Leave a Reply

Your email address will not be published. Required fields are marked *