Kaggle is hosting a contest where the task is to predict survival rates of people aboard the titanic. A train set is given with a label 1 or 0, denoting ‘survived’ or ‘died’. We are going to use Vowpal Wabbit to get a score of about 0.79426 AUC (top 10%).
The contest
In this Kaggle contest, they ask you to complete the analysis of what sorts of people were likely to survive. In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Under construction. Come back soon.
Tutorials
This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning. The data is highly structured and they provide 4 tutorials of increasing complexity.
- Getting started with Excel
- Getting started with Python
- Getting started with Scikit random forests
- Getting started with R
Automated ML
We take the approach we dubbed “Automated ML”. This means we are going to create a submission in under 1 hour and make heavy use of tools and multi-purpose scripts to keep this process as automatic and streamlined as possible.
First we need to convert the .csv train and test file to a format that Vowpal Wabbit can deal with. Vowpal Wabbit allows for very human-readable data sets.
Download the data sets for this competition and run the following script (perhaps changing the location/name of your train and test sets to match the script)
import csv import re i = 0 def clean(s): return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower() with open("train_titanic.csv", "r") as infile, open("train_titanic.vw", "wb") as outfile: reader = csv.reader(infile) for line in reader: i+= 1 if i > 1: vw_line = "" if str(line[1]) == "1": vw_line += "1 '" else: vw_line += "-1 '" vw_line += str(line[0]) + " |f " vw_line += "passenger_class_"+str(line[2])+" " vw_line += "last_name_" + clean(line[3].split(",")[0]).replace(" ", "_") + " " vw_line += "title_" + clean(line[3].split(",")[1]).split()[0] + " " vw_line += "sex_" + clean(line[4]) + " " if len(str(line[5])) > 0: vw_line += "age:" + str(line[5]) + " " vw_line += "siblings_onboard:" + str(line[6]) + " " vw_line += "family_members_onboard:" + str(line[7]) + " " vw_line += "embarked_" + str(line[11]) + " " outfile.write(vw_line[:-1] + "\n") i = 0 with open("test_titanic.csv", "r") as infile, open("test_titanic.vw", "wb") as outfile: reader = csv.reader(infile) for line in reader: i+= 1 if i > 1: vw_line = "" vw_line += "1 '" vw_line += str(line[0]) + " |f " vw_line += "passenger_class_"+str(line[1])+" " vw_line += "last_name_" + clean(line[2].split(",")[0]).replace(" ", "_") + " " vw_line += "title_" + clean(line[2].split(",")[1]).split()[0] + " " vw_line += "sex_" + clean(line[3]) + " " if len(str(line[4])) > 0: vw_line += "age:" + str(line[4]) + " " vw_line += "siblings_onboard:" + str(line[5]) + " " vw_line += "family_members_onboard:" + str(line[6]) + " " vw_line += "embarked_" + str(line[10]) + " " outfile.write(vw_line[:-1] + "\n")
This Python scripts turns .csv lines like:
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
Into .vw (vowpal wabbit) lines like:
-1 '1 |f passenger_class_3 last_name_braund title_mr sex_male age:22 siblings_onboard:1 family_members_onboard:0 embarked_S
Vowpal Wabbit
Download and install Vowpal Wabbit from the Github Repository.
Then run Vowpal Wabbit with the following command:
vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --adaptive --normalized --l1 0.00000001 --l2 0.0000001 -b 24
This says to train a model (model.vw) on train_titanic.vw. As it is binary classification it is recommended to use the parameter --binary
. Then we do 20 passes over our dataset, we create a cache file for this. We use quadratic features with -q ff
, meaning we create feature pairs. We use adapative and normalized rules. Some l1 and l2 regularization to prevent over-fitting. And 24 bits to store our feature hashes.
Now we use our model and test set to create predictions.
vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txt
We can turn our predictions to Kaggle format with the following Python script:
import csv with open("preds_titanic.txt", "r") as infile, open("kaggle_preds.csv", "wb") as outfile: outfile.write("PassengerId,Survived\n") for line in infile.readlines(): kaggle_line = str(line.split(" ")[1]).replace("\n","") if str(int(float(line.split(" ")[0]))) == "1": kaggle_line += ",1\n" else: kaggle_line += ",0\n" outfile.write(kaggle_line)
Submit kaggle_preds.csv
to receive your top 10% public leaderboard score!
Strange. I followed all the steps outlined in this page and downloaded the latest version of vw from git as of today May 13, 2015 and got a “Your submission scored 0.73206” instead of the 0.79426. Anyone else seeing this and any clues of why my results are way off ?
Hi,
I tried running the script, and its running successfully. But I’m getting output in between -1 and +1. And because of this getting error while converting vowpal wabbit output to kaggle required .csv format.
Any idea why it’s happening ?
Thanks
Raj