Another Kaggle contest means another chance to try out Vowpal Wabbit. This time on a data set of nearly 350 million rows. We will discuss feature engineering for the latest Kaggle contest and how to get a top 3 public leaderboard score (~0.59347 AUC).
A short competition description
The competition is to predict repeat buyers (those who redeem a coupon and purchase that product afterwards). For this we have the labelled data (did become repeat buyer, did not become repeat buyer) for about 150.000 shoppers (the train set).
Our task is to predict the labels for about 150.000 other shoppers (the test set). For this we can use a file called transactions.csv. It’s a huge file (unzipped about 22GB) containing nearly 350 million rows. The total amount spend in the transaction data is nearing 1.5 billion.
Open the files in your favorite text editor or excel program. Check out the leaderboard for a short description of the 4 benchmarks.
Master Kaggle user BreakfastPirate (Steve Donoho) posted a way to reduce the dataset. If you check out the offers.csv file you’ll see all the categories and companies a coupon offer can have. We can discard the rows from the transactions data which don’t have a category id or a company id which is on offer.
The function reduce_data() in the messy code accompanying this post can do this for you. It runs in about 5-10 minutes and will reduce the ~350 million rows to ~27 million rows. This makes our future model code more manageble (around 1.6GB).
A large part of this competition is feature engineering: Creating good indicative features from the purchase history. From the benchmark we already have 4 features: Has bought in the coupon offer category before, has bought the brand before, has bought from the company before, has bought company + category + brand on offer.
We could put these into a binary feature (1 or 0), but we have all the transaction data, we can count how many times someone has bought inside a category.
Has bought from company on offer
We generate a feature: has_bought_company where we count how many times the shopper has bought a product from the company on offer. We generate a related feature:
has_bought_company_q which holds the quantity bought (sometimes shoppers buy multiple items at once). And another feature that counts the total amount spend on a company on offer:
We also generate features that count the days between the previous purchases and the date of the coupon offer. So if for instance the shopper spend 50$ on a company in the last 3 months we would set has_bought_company_a_90 to 50. We generate these features for last 30 days, last 60, last 90 and last 180 days.
If the shopper has never bought a product from a company on the coupon offer then we generate a negative feature:
Has bought from category on the coupon offer
This is basically the same as above, only for the category. We also generate features for date ranges and generate negative features if the shopper has never bought from the category on offer.
Has bought brand on the coupon offer
We check if the user has bought the brand before that is on the coupon offer. We then generate the same features as above.
Combinations of brand, category and company on offer
If the shopper has bought from the brand, category and company before we generate a specific feature for that. Also for individual combinations like brand + company. And we again generate negative features, like:
The offer value and offer quantity
This is a constant for every offer. The offer value might influence the number of repeat buyers. The offer value fluctuates between about 5 and 0.75.
We also get the offer quantity (how many items can be redeemed with the coupon). We think this may influence number of repeat buyers. UPDATE: Kaggle user Mathieu Cliche has posted that the
offer_quantity is always
1 in the train data, rendering this feature useless.
Total shopper spend
We are interested to see how much the shopper spend. We hope this won’t change too much from the original data set, if we only count the amounts from the reduced data set. For every transaction still in the reduced data set we take the amount and add it all up. We think that total shopper spend will influence future chance of repeat buys.
We have now generated a test set and a train set with our features. A line from the train set could look like:
1 '86246 |f offer_quantity:1 has_bought_company_a:243.63 has_bought_brand_180:7.0 has_bought_brand_a_180:23.13 has_bought_brand_q_180:7.0 offer_value:2 has_bought_brand_a_60:14.95 has_bought_company_q:37.0 has_bought_brand_q_30:1.0 has_bought_brand:8.0 has_bought_company_q_30:6.0 has_bought_brand_30:1.0 has_bought_company_q_60:16.0 has_bought_brand_company:1 has_bought_brand_90:6.0 has_bought_company_q_180:19.0 has_bought_company_30:6.0 has_bought_brand_a:28.71 has_bought_company_a_90:106.13 has_bought_brand_q_90:6.0 never_bought_category:1 has_bought_company_180:19.0 has_bought_brand_q:9.0 has_bought_company_a_30:46.74 has_bought_company_q_90:17.0 has_bought_brand_a_30:4.59 total_spend:4140.41 has_bought_company_a_60:100.44 has_bought_brand_q_60:5.0 has_bought_company_a_180:113.21 has_bought_company_60:16.0 has_bought_brand_60:5.0 has_bought_company_90:17.0 has_bought_brand_a_90:20.64 has_bought_company:36.0
The test set should look similar.
Now we run Vowpal Wabbit (we use version 7.1, your results may vary if you pick another version) and train a model with our train set.
vw shop.train.vw -c -k --passes 40 -l 0.85 -f shop.model.vw --loss_function quantile --quantile_tau 0.6
-c -k --passes 40says to use a cache, kill any previous cache and run 40 passes
-l 0.85sets the learning rate to 0.85
-f shop.model.vwsaves the model
--loss_function quantilesays to use quantile regression
--quantile_tau 0.6is a parameter to tweak when using the quantile loss function.
We get an average loss of 0.1562.
Now we use the model and the train set to get us predictions:
vw shop.test.vw -t -i shop.model.vw -p shop.preds.txt
-tsays to test only
-isays to load a certain model
-psays to store predictions
vw-varinfo is a small wrapper around Vowpal Wabbit which exposes all variables of a model in human readable form. We can use this to check how relevant our features are. If we run the output of vw-varinfo through our plotfeatures.py script from the Movie review sentiment analysis post we get the image below:
We are almost there. We have a file with predictions (our output from Vowpal Wabbit) and we need to turn this into the Kaggle submission format. You can do this with generate_submission() or write your own script for it.
We lack predictions for about 200 shoppers as their transaction data did not include any product from a category, brand or company on offer. We predict a 0 for these cases.
Our first submission scored position 3 on the public leaderboard. After some tweaking it increased the score and is again at position 3. Using Python, a just-has-to-work mentality and the magnificent tool Vowpal Wabbit we were able to create a competing submission in a few hours.
The scripts takes about 15 minutes to produce this submission from the raw data. It takes under 1 GB of memory and will thus run even on budget laptops.
You can find all code in the MLWave Github repo. If you use this code and find better parameters for Vowpal Wabbit, or find a better feature to use, we’d really appreciate it if you posted it here, on the
DataTau post or on the competition forums. Happy competition!