Kaggle is hosting another cool knowledge contest, this time it is sentiment analysis on the Rotten Tomatoes Movie Reviews data set. We are going to use Vowpal Wabbit to test the waters and get our first top 10 leaderboard score.
The Rotten Tomatoes movie review data set is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [pdf]. In their work on sentiment treebanks, Socher et al. [pdf] used Amazon’s Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus.
The train and test data sets are tab-separated files with phrases from the Rotten Tomatoes data set.
Each sentence has been parsed into many phrases by the Stanford parser.
A quick glance of the raw training data set shows us:
... 3592 134 It 's hard to say who might enjoy this 0 3593 134 's hard to say who might enjoy this 1 3594 134 hard to say who might enjoy this 0 3595 134 to say who might enjoy this 2 3596 134 say who might enjoy this 2 3597 134 who might enjoy this 3 3598 134 might enjoy this 3 3599 134 enjoy this 3 ...
Where the header is:
PhraseId SentenceId Phrase Sentiment
Task 1: Start by downloading the data for yourself. The files are small enough for this competition to open them up in your favorite text editor. Skim the data sets to get a feel for the data.
Every phrase has a label describing the sentiment. The labels are:
- 0 – negative
- 1 – somewhat negative
- 2 – neutral
- 3 – somewhat positive
- 4 – positive
To build a model that can classify multiple labels we need a multiclass approach (PDF lecture). Vowpal Wabbit can reduce the multiclass problem to as many binary classification problems as there are classes.
Your submission to Kaggle is evaluated on plain accuracy (the percentage of correctly predicted labels). Submissions with predicted labels should follow the standard Kaggle format:
PhraseId,Sentiment 156061,2 156062,3 156063,1 ...
Introducing Vowpal Wabbit deserves an entire article or even a book. In this article we only show how to run Vowpal Wabbit (maybe for the first time) and a short explanation of the settings/parameters used.
Vowpal Wabbit has a very flexible and human-readable input format. It can handle raw text (no need to vectorize it).
An example training set to classify animals (label 1) from non-animals (label -1) could look like:
... 1 'horse |f color_brown avg_age:6.5 has_legs:4 a creature on the farm |a wikipedia_mentions:15 -1 'oak |f color_brown avg_age:75 prospers near ponds and lakes |a wikipedia_mentions:5 ...
1 is the label,
'horse is the identifier,
|a are feature spaces (useful to create feature pairs with
-q fa or to ignore certain features
Features themselves can be raw or pre-processed text
...creature on the.... You can add a weight by appending a
: followed by a float or int. If absent the weight is assumed to be
If there are multiple labels, like in this contest, Vowpal Wabbit expects labels to be positive integers, starting from 1. 5 labels (very negative, negative, neutral, positive, very positive) could be identified as
Task 2: Get Vowpal Wabbit up and running on your machine. You should download the latest version (7.4 as of writing) and build it. You could build Vowpal Wabbit on Windows machines with CygWin.
For convenience sake we’ve included a stand-alone Windows executable of Vowpal Wabbit 7.1 in the GitHub repo accompanying this article. Download the build, download the executable, place it in the /vowpalwabbit/ directory and you should be good to go.
Try running Vowpal Wabbit from the command prompt and you should see all available command line options.
Data Munging & Feature Generation
We can’t feed Vowpal Wabbit tab separated files without any context. We need to turn the raw data sets into a Vowpal Wabbit-friendly format first.
In this data munging step we also decide on what our features will be. For this article our features will be the words of a phrase and the length of the phrase.
Python is perfect for data munging. You can quickly whip up a script to transform raw data any way you want. To turn train.tsv into train.vw:
import csv import re location_train = "kaggle_rotten\\train.tsv" location_test = "kaggle_rotten\\test.tsv" location_train_vw = "rotten.train.vw" #will be created location_test_vw = "rotten.test.vw" #will be created #cleans a string "I'm a string!?" returns as "i m a string" def clean(s): return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower() #creates Vowpal Wabbit-formatted file from tsv file def to_vw(location_input_file, location_output_file, test = False): print "\nReading:",location_input_file,"\nWriting:",location_output_file with open(location_input_file) as infile, open(location_output_file, "wb") as outfile: #create a reader to read train file reader = csv.DictReader(infile, delimiter="\t") #for every line for row in reader: #if test set label doesnt matter/or isnt available if test: label = "1" else: label = str(int(row['Sentiment'])+1) phrase = clean(row['Phrase']) outfile.write( label + " '"+row['PhraseId'] + " |f " + phrase + " |a " + "word_count:"+str(phrase.count(" ")+1) + "\n" ) to_vw(location_train, location_train_vw) to_vw(location_test, location_test_vw, test=True)
Task 3: Run the above script or run your own command-line magic to transform the .tsv data sets into .vw data sets. You should create a training set and a test set in Vowpal Wabbit format.
The train set should now look like:
... 4 '22 |f good for the goose |a word_count:4 4 '23 |f good |a word_count:1 3 '24 |f for the goose |a word_count:3 ...
and the test set should look like:
... 1 '156071 |f mostly routine |a word_count:2 1 '156072 |f mostly |a word_count:1 1 '156073 |f routine |a word_count:1 ...
Creating a Model
Now we have our data sets in the correct format we can create our model with Vowpal Wabbit.
The command we will use to create a model:
vw rotten.train.vw -c -k --passes 300 --ngram 7 -b 24 --ect 5 -f rotten.model.vw
vwis the Vowpal Wabbit executable
rotten.train.vwis our train set
-c -kmeans to use a cache for multiple passes, and kill any existing cache
--passes 300means to make 300 passes over our data set
--ngram 7tells Vowpal Wabbit to create n-grams (7-grams in this case).
-b 24tells Vowpal Wabbit to use 24-bit hashes (18-bit hashes is default)
-f rotten.model.vwmeans “save model as ‘rotten.model.vw'”.
--ect(error correcting tournament [pdf]) in very simple terms tells Vowpal Wabbit that there are 5 possible labels and we want it to pick one.
Multiple passes over the data allows Vowpal Wabbit to better fit its model.
n-grams increase performance because a phrase like “this movie was not good” would score positive sentiment for the token “good”. If 2-grams were used the model could detect negative sentiment in the token “not good”.
Vowpal Wabbit is so incredibly fast in part due to the hashing trick. With many features and a small-sized hash collisions start occurring. These collisions may influence the results. Often for the worse, but not necessarily: Multiple features sharing the same hash can have a PCA-like effect of dimensionality reduction.
One could also use
--oaa (one against all) instead of
--ect (error correcting tournament) but “ect” at times outperforms “oaa”. In this case it produces a lower average loss.
Task 4: Create a model from the training set. Play around with the different settings. Can you make a model with a lower average loss?
Making Predictions on the Test set
Now we have our model we can tell Vowpal Wabbit to predict the labels for our test set.
To run Vowpal Wabbit in test mode and create predictions:
vw rotten.test.vw -t -i rotten.model.vw -p rotten.preds.txt
vwis the Vowpal Wabbit executable
rotten.test.vwthe location to our test set
-ttells to test only (no learning)
-i rotten.model.vwsays to use rotten.model.vw as the model
-p rotten.preds.txtmeans “save predictions as ‘rotten.preds.txt'”
We now have a file rotten.preds.txt with the predictions:
... 4.000000 156062 3.000000 156063 4.000000 156064 ...
We just need to transform these to the Kaggle submission format:
... 156062,3 156063,2 156064,3 ...
Note that we have to subtract 1, because we changed label
0 to label
1 when changing to Vowpal Wabbit format.
The following script could make the Kaggle submission file:
import csv def to_kaggle(location_input_file, header="", location_output_file="kaggle.submission.csv"): print "\nReading:",location_input_file,"\nWriting:",location_output_file with open(location_input_file) as infile, open(location_output_file, "wb") as outfile: if len(header) > 0: outfile.write( header + "\n" ) reader = csv.reader(infile, delimiter=" ") for row in reader: outfile.write( row + "," + str(int(row)-1) + "\n" ) to_kaggle("rotten.preds.txt", "PhraseId,Sentiment")
Task 5: Run the test set on your model and create a predictions file. Then transform the predictions file to Kaggle Submission format. Then hurry and make a top 10 submission on Kaggle!
And there you have it: Simple, fast sentiment analysis with a linear solver. You can find the code accompanying this article on GitHub. If we find any improvements we will update the code. If you found an improvement be sure to let us know in the comments or on the competition forums.
Vowpal Wabbit can give a quick overview of feature relevance. You can do this with vw-varinfo (a perl script).
Running vw-varinfo gives output like this:
FeatureName HashVal Weight RelScore f^remarkable 176330 +1.1790 94.01% f^brilliant 158704 +1.1363 90.60% f^terrific 232040 +1.1352 90.51% ... f^lacks 145204 -1.1340 -90.42% f^failure 157645 -1.2183 -97.14% f^worst 259228 -1.2542 -100.00%
We are interested in the
RelScore and the
FeatureName. With this and pylab we can plot the feature relevance histograms below:
ML Wave would like to thank you, Kaggle, Rotten Tomatoes, the contestants and anyone who contributes to machine learning research theory and tools. If you are from datatau.com and you’ve read this far, great!
- Feature selection.
- Grid search.
- Implement Stanford NLP Sentiment package