titanic

Titanic – Machine Learning From Distaster with Vowpal Wabbit

Kaggle is hosting a contest where the task is to predict survival rates of people aboard the titanic. A train set is given with a label 1 or 0, denoting ‘survived’ or ‘died’. We are going to use Vowpal Wabbit to get a score of about 0.79426 AUC (top 10%).

The contest

In this Kaggle contest, they ask you to complete the analysis of what sorts of people were likely to survive.  In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Under construction. Come back soon.

Tutorials

This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning. The data is highly structured and they provide 4 tutorials of increasing complexity.

Automated ML

We take the approach we dubbed “Automated ML”. This means we are going to create a submission in under 1 hour and make heavy use of tools and multi-purpose scripts to keep this process as automatic and streamlined as possible.

First we need to convert the .csv train and test file to a format that Vowpal Wabbit can deal with. Vowpal Wabbit allows for very human-readable data sets.

Download the data sets for this competition and run the following script (perhaps changing the location/name of your train and test sets to match the script)

import csv
import re
i = 0
def clean(s):
  return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower()
with open("train_titanic.csv", "r") as infile, open("train_titanic.vw", "wb") as outfile:
  reader = csv.reader(infile)
  for line in reader:
    i+= 1
    if i > 1:
      vw_line = ""
      if str(line[1]) == "1":
        vw_line += "1 '"
      else:
        vw_line += "-1 '"
      vw_line += str(line[0]) + " |f "
      vw_line += "passenger_class_"+str(line[2])+" "

      vw_line += "last_name_" + clean(line[3].split(",")[0]).replace(" ", "_") + " "
      vw_line += "title_" + clean(line[3].split(",")[1]).split()[0] + " "
      vw_line += "sex_" + clean(line[4]) + " "
      if len(str(line[5])) > 0:
        vw_line += "age:" + str(line[5]) + " "
      vw_line += "siblings_onboard:" + str(line[6]) + " "
      vw_line += "family_members_onboard:" + str(line[7]) + " "
      vw_line += "embarked_" + str(line[11]) + " "
      outfile.write(vw_line[:-1] + "\n")
i = 0
with open("test_titanic.csv", "r") as infile, open("test_titanic.vw", "wb") as outfile:
  reader = csv.reader(infile)
  for line in reader:
    i+= 1
    if i > 1:
      vw_line = ""
      vw_line += "1 '"
      vw_line += str(line[0]) + " |f "
      vw_line += "passenger_class_"+str(line[1])+" "
      vw_line += "last_name_" + clean(line[2].split(",")[0]).replace(" ", "_") + " "
      vw_line += "title_" + clean(line[2].split(",")[1]).split()[0] + " "
      vw_line += "sex_" + clean(line[3]) + " "
      if len(str(line[4])) > 0:
        vw_line += "age:" + str(line[4]) + " "
      vw_line += "siblings_onboard:" + str(line[5]) + " "
      vw_line += "family_members_onboard:" + str(line[6]) + " "
      vw_line += "embarked_" + str(line[10]) + " "
      outfile.write(vw_line[:-1] + "\n")

This Python scripts turns .csv lines like:

1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S

Into .vw (vowpal wabbit) lines like:

-1 '1 |f passenger_class_3 last_name_braund title_mr sex_male age:22 siblings_onboard:1 family_members_onboard:0 embarked_S

Vowpal Wabbit

Download and install Vowpal Wabbit from the Github Repository.

Then run Vowpal Wabbit with the following command:

vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --adaptive --normalized --l1 0.00000001 --l2 0.0000001 -b 24

This says to train a model (model.vw) on train_titanic.vw. As it is binary classification it is recommended to use the parameter --binary. Then we do 20 passes over our dataset, we create a cache file for this. We use quadratic features with -q ff, meaning we create feature pairs. We use adapative and normalized rules. Some l1 and l2 regularization to prevent over-fitting. And 24 bits to store our feature hashes.

Now we use our model and test set to create predictions.

vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txt

We can turn our predictions to Kaggle format with the following Python script:

import csv
with open("preds_titanic.txt", "r") as infile, open("kaggle_preds.csv", "wb") as outfile:
  outfile.write("PassengerId,Survived\n")
  for line in infile.readlines():
    kaggle_line = str(line.split(" ")[1]).replace("\n","")
    if str(int(float(line.split(" ")[0]))) == "1":
      kaggle_line += ",1\n"
    else:
      kaggle_line += ",0\n"
    outfile.write(kaggle_line)

Submit kaggle_preds.csv to receive your top 10% public leaderboard score!

3 thoughts on “Titanic – Machine Learning From Distaster with Vowpal Wabbit”

  1. Strange. I followed all the steps outlined in this page and downloaded the latest version of vw from git as of today May 13, 2015 and got a “Your submission scored 0.73206” instead of the 0.79426. Anyone else seeing this and any clues of why my results are way off ?

  2. Hi,

    I tried running the script, and its running successfully. But I’m getting output in between -1 and +1. And because of this getting error while converting vowpal wabbit output to kaggle required .csv format.

    Any idea why it’s happening ?

    Thanks
    Raj

Leave a Reply

Your email address will not be published. Required fields are marked *