Shopping cart

Predicting repeat buyers using purchase history

Another Kaggle contest means another chance to try out Vowpal Wabbit. This time on a data set of nearly 350 million rows. We will discuss feature engineering for the latest Kaggle contest and how to get a top 3 public leaderboard score (~0.59347 AUC).

A short competition description

The competition is to predict repeat buyers (those who redeem a coupon and purchase that product afterwards). For this we have the labelled data (did become repeat buyer, did not become repeat buyer) for about 150.000 shoppers (the train set).

Our task is to predict the labels for about 150.000 other shoppers (the test set). For this we can use a file called transactions.csv. It’s a huge file (unzipped about 22GB) containing nearly 350 million rows. The total amount spend in the transaction data is nearing 1.5 billion.

Data exploration

You can download all data and explore the structure of the data while transactions.csv is downloading. User Vadim Kyssa has created a sample file of transactions.csv with only a few shoppers.

Open the files in your favorite text editor or excel program. Check out the leaderboard for a short description of the 4 benchmarks.

Data reduction

Master Kaggle user BreakfastPirate (Steve Donoho) posted a way to reduce the dataset. If you check out the offers.csv file you’ll see all the categories and companies a coupon offer can have. We can discard the rows from the transactions data which don’t have a category id or a company id which is on offer.

The function reduce_data() in the messy code accompanying this post can do this for you. It runs in about 5-10 minutes and will reduce the ~350 million rows to ~27 million rows. This makes our future model code more manageble (around 1.6GB).

Feature engineering

A large part of this competition is feature engineering: Creating good indicative features from the purchase history. From the benchmark we already have 4 features: Has bought in the coupon offer category before, has bought the brand before, has bought from the company before, has bought company + category + brand on offer.

We could put these into a binary feature (1 or 0), but we have all the transaction data, we can count how many times someone has bought inside a category.

Has bought from company on offer

We generate a feature: has_bought_company where we count how many times the shopper has bought a product from the company on offer. We generate a related feature: has_bought_company_q which holds the quantity bought (sometimes shoppers buy multiple items at once). And another feature that counts the total amount spend on a company on offer: has_bought_company_a.

We also generate features that count the days between the previous purchases and the date of the coupon offer. So if for instance the shopper spend 50$ on a company in the last 3 months we would set has_bought_company_a_90 to 50. We generate these features for last 30 days, last 60, last 90 and last 180 days.

If the shopper has never bought a product from a company on the coupon offer then we generate a negative feature: has_not_bought_company.

Has bought from category on the coupon offer

This is basically the same as above, only for the category. We also generate features for date ranges and generate negative features if the shopper has never bought from the category on offer.

Has bought brand on the coupon offer

We check if the user has bought the brand before that is on the coupon offer. We then generate the same features as above.

Combinations of brand, category and company on offer

If the shopper has bought from the brand, category and company before we generate a specific feature for that. Also for individual combinations like brand + company. And we again generate negative features, like: has_not_bought_brand_company.

The offer value and offer quantity

This is a constant for every offer. The offer value might influence the number of repeat buyers. The offer value fluctuates between about 5 and 0.75.

We also get the offer quantity (how many items can be redeemed with the coupon). We think this may influence number of repeat buyers. UPDATE: Kaggle user Mathieu Cliche has posted that the offer_quantity is always 1 in the train data, rendering this feature useless.

Total shopper spend

We are interested to see how much the shopper spend. We hope this won’t change too much from the original data set, if we only count the amounts from the reduced data set. For every transaction still in the reduced data set we take the amount and add it all up. We think that total shopper spend will influence future chance of repeat buys.

Vowpal Wabbit

We have now generated a test set and a train set with our features. A line from the train set could look like:

1 '86246 |f offer_quantity:1 has_bought_company_a:243.63 has_bought_brand_180:7.0 has_bought_brand_a_180:23.13 has_bought_brand_q_180:7.0 offer_value:2 has_bought_brand_a_60:14.95 has_bought_company_q:37.0 has_bought_brand_q_30:1.0 has_bought_brand:8.0 has_bought_company_q_30:6.0 has_bought_brand_30:1.0 has_bought_company_q_60:16.0 has_bought_brand_company:1 has_bought_brand_90:6.0 has_bought_company_q_180:19.0 has_bought_company_30:6.0 has_bought_brand_a:28.71 has_bought_company_a_90:106.13 has_bought_brand_q_90:6.0 never_bought_category:1 has_bought_company_180:19.0 has_bought_brand_q:9.0 has_bought_company_a_30:46.74 has_bought_company_q_90:17.0 has_bought_brand_a_30:4.59 total_spend:4140.41 has_bought_company_a_60:100.44 has_bought_brand_q_60:5.0 has_bought_company_a_180:113.21 has_bought_company_60:16.0 has_bought_brand_60:5.0 has_bought_company_90:17.0 has_bought_brand_a_90:20.64 has_bought_company:36.0

The test set should look similar.


Now we run Vowpal Wabbit (we use version 7.1, your results may vary if you pick another version) and train a model with our train set.

vw shop.train.vw -c -k --passes 40 -l 0.85 -f shop.model.vw --loss_function quantile --quantile_tau 0.6


  • -c -k --passes 40 says to use a cache, kill any previous cache and run 40 passes
  • -l 0.85 sets the learning rate to 0.85
  • -f shop.model.vw saves the model
  • --loss_function quantile says to use quantile regression
  • --quantile_tau 0.6 is a parameter to tweak when using the quantile loss function.

We get an average loss of 0.1562.


Now we use the model and the train set to get us predictions:

vw shop.test.vw -t -i shop.model.vw -p shop.preds.txt


  • -t says to test only
  • -i says to load a certain model
  • -p says to store predictions

Feature relevance

vw-varinfo is a small wrapper around Vowpal Wabbit which exposes all variables of a model in human readable form. We can use this to check how relevant our features are. If we run the output of vw-varinfo through our script from the Movie review sentiment analysis post we get the image below:

a feature plot

Kaggle submission

We are almost there. We have a file with predictions (our output from Vowpal Wabbit) and we need to turn this into the Kaggle submission format. You can do this with generate_submission() or write your own script for it.

We lack predictions for about 200 shoppers as their transaction data did not include any product from a category, brand or company on offer. We predict a 0 for these cases.

Our first submission scored position 3 on the public leaderboard. After some tweaking it increased the score and is again at position 3. Using Python, a just-has-to-work mentality and the magnificent tool Vowpal Wabbit we were able to create a competing submission in a few hours.

The scripts takes about 15 minutes to produce this submission from the raw data. It takes under 1 GB of memory and will thus run even on budget laptops.


You can find all code in the MLWave Github repo. If you use this code and find better parameters for Vowpal Wabbit, or find a better feature to use, we’d really appreciate it if you posted it here, on the
dt DataTau post or on the competition forums. Happy competition!

11 thoughts on “Predicting repeat buyers using purchase history”

  1. When i run ./vw, i got
    ./vw: line 6: syntax error near unexpected token `newline’
    ./vw: line 6: `’

    Do you know why? Thanks.

    1. Hi! Here is a validator for VW-formatted files:

      Check that you have no empty lines, that your lines are trimmed from spaces, and that your sample ID starts with ‘ (a single quote) and not `(backtick).

      If you still have problems paste a sample of your .vw dataset and I’ll a deeper look.

      1. Hi, Triskelion, thanks for your reply.
        I got another problem….
        I followed your guide
        and downloaded executable from this link:
        and when i run vw, i got :
        vowpalwabbit\vw.exe is not compatible with the version of Windows you’re running. Check your computer’s system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher.

        My computer is win7 32bit. Maybe i should try this under Unbuntu….
        Thanks for your time.


        1. Heya douhuang! Another reader had found a problem with that tutorial too. I am updating that post now to include simpler building instructions (for any bit system), so you won’t need to download the executable. Hang on!

  2. I’ve been working on this problem in R and outside of VW. I’ve been using regularized regression through glmnet package (with 10-fold cross validation) and have come up short every time of the .59 benchmark (with a best around .578. I’ve run it with the exact features mentioned in this analysis, so I’m not sure why I would get something different. I had understood quantile regression to be performed on continuous variables and not logistic models. Is there something that is happening during the quantile regression?

    I realize this post isn’t about R, but maybe my question could give some insight into what is going on in VW.

    1. Hey, ddunder! I’ve heard more people with R having troubles getting a good score (higher than 59).

      About the quantile regression: It seems that I picked the ‘wrong’ loss function to get a higher score. I think, since we are predicting probabilities of repeat trips, that logistic loss should be optimized (if I understand it correctly).

      Quantile loss: House prices (minimizer = absolute value)
      Hinge loss: 0 or 1 (Is this email a spam mail? Closest 0/1 approximator)
      Squared loss: What is the expected return on a stock? (Minimizer: expectation)
      Logistic loss: Probability of click on advertisement (minimizer: probability)

      Do note that VW does a little more than just regression. Its default is a normalized adaptive importance-invariant update rule. Also quantile regression is good for ranking, which I think helps with AUC evaluation.

      About R, glmnet and the features mentioned I can’t say too much, without updating the forum post on the Kaggle forums. I will soon, once I find the time to optimize my current score (basically this benchmark with a few tweaks).

  3. I have a pretty stupid question but from this whole tutorial there was one thing that I could not figure out.

    “We lack predictions for about 200 shoppers as their transaction data did not include any product from a category, brand or company on offer. We predict a 0 for these cases.”

    How do you find the customerID’s of those ~200 people?
    I spent many hours trying to find them while I know it must be something really simple…

  4. Hi Triskelion,
    Thanks a lot for this informative blog. I have used logistic regression to build my model and I am getting a score of 60.4. I am now trying quantile regression using vowpal wabbit. I had a couple of queries regarding quantile regression:
    a. Do the variables that I use need to be relatively free of correlation as is the case with logistic regression?Or is it fine if the variables have large correlation between them and vowpal wabbit takes care of it?
    b. Do the variables need to have monoticcally increasing or decreasing trend w.r.t. the response rates? Lets take an exmaple of variable like number of transactions. Do the response rates have to increase with increasing number of transactions for quantile regression to work well? Because I have some variables where the response rate initially increases and then decreases as the variable’s value increases. Does quantile regression give out probabilities which match this trend?

    Thank You

  5. Hello Triskelion

    I ve tried to run your features in VW. Your python script worked really well for me. But I ve problems running it in VW.

    I start VW with ./vw, but when I paste your code, I get this:
    vw train.vw -c -k –passes 40 -l 0.85 -f model.vw –loss_function quantile –quantile_tau 0.6
    malformed example!
    words.size() = 13
    vw test.vw -t -i model.vw -p shop.preds.txt
    malformed example!
    words.size() = 6

    0.000000 0.000000 4 4.0 unknown 0.0000 1

    What am I missing? Where I have to save my VW files? I ve tried my admin folder, as well as in the VW folder I run the shell script vw.

    And I asked myself another question. I want to transform your python script into LIBSVM format to run in in Apache Spark. Therefore I need to change input format:

    1 ’86246 |f offer_quantity:1 ….

    : : …

    I can not figure out where set the labels in the python code (1 ’86246 |f ) . Is this a weight after the upper semicolon or something else?

    Kind regards from Zurich,

  6. Hi Triskellon,

    I couldn’t run your python code. ( I am new to python) Whenever i try to run it, i am receiving an syntax error on line 197( print e )
    Kindly help with this code.

Leave a Reply

Your email address will not be published. Required fields are marked *