Shopping cart

Predicting repeat buyers using purchase history

Another Kaggle contest means another chance to try out Vowpal Wabbit. This time on a data set of nearly 350 million rows. We will discuss feature engineering for the latest Kaggle contest and how to get a top 3 public leaderboard score (~0.59347 AUC).

A short competition description

The competition is to predict repeat buyers (those who redeem a coupon and purchase that product afterwards). For this we have the labelled data (did become repeat buyer, did not become repeat buyer) for about 150.000 shoppers (the train set).

Our task is to predict the labels for about 150.000 other shoppers (the test set). For this we can use a file called transactions.csv. It’s a huge file (unzipped about 22GB) containing nearly 350 million rows. The total amount spend in the transaction data is nearing 1.5 billion.

Data exploration

You can download all data and explore the structure of the data while transactions.csv is downloading. User Vadim Kyssa has created a sample file of transactions.csv with only a few shoppers.

Open the files in your favorite text editor or excel program. Check out the leaderboard for a short description of the 4 benchmarks.

Data reduction

Master Kaggle user BreakfastPirate (Steve Donoho) posted a way to reduce the dataset. If you check out the offers.csv file you’ll see all the categories and companies a coupon offer can have. We can discard the rows from the transactions data which don’t have a category id or a company id which is on offer.

The function reduce_data() in the messy code accompanying this post can do this for you. It runs in about 5-10 minutes and will reduce the ~350 million rows to ~27 million rows. This makes our future model code more manageble (around 1.6GB).

Feature engineering

A large part of this competition is feature engineering: Creating good indicative features from the purchase history. From the benchmark we already have 4 features: Has bought in the coupon offer category before, has bought the brand before, has bought from the company before, has bought company + category + brand on offer.

We could put these into a binary feature (1 or 0), but we have all the transaction data, we can count how many times someone has bought inside a category.

Has bought from company on offer

We generate a feature: has_bought_company where we count how many times the shopper has bought a product from the company on offer. We generate a related feature: has_bought_company_q which holds the quantity bought (sometimes shoppers buy multiple items at once). And another feature that counts the total amount spend on a company on offer: has_bought_company_a.

We also generate features that count the days between the previous purchases and the date of the coupon offer. So if for instance the shopper spend 50$ on a company in the last 3 months we would set has_bought_company_a_90 to 50. We generate these features for last 30 days, last 60, last 90 and last 180 days.

If the shopper has never bought a product from a company on the coupon offer then we generate a negative feature: has_not_bought_company.

Has bought from category on the coupon offer

This is basically the same as above, only for the category. We also generate features for date ranges and generate negative features if the shopper has never bought from the category on offer.

Has bought brand on the coupon offer

We check if the user has bought the brand before that is on the coupon offer. We then generate the same features as above.

Combinations of brand, category and company on offer

If the shopper has bought from the brand, category and company before we generate a specific feature for that. Also for individual combinations like brand + company. And we again generate negative features, like: has_not_bought_brand_company.

The offer value and offer quantity

This is a constant for every offer. The offer value might influence the number of repeat buyers. The offer value fluctuates between about 5 and 0.75.

We also get the offer quantity (how many items can be redeemed with the coupon). We think this may influence number of repeat buyers. UPDATE: Kaggle user Mathieu Cliche has posted that the offer_quantity is always 1 in the train data, rendering this feature useless.

Total shopper spend

We are interested to see how much the shopper spend. We hope this won’t change too much from the original data set, if we only count the amounts from the reduced data set. For every transaction still in the reduced data set we take the amount and add it all up. We think that total shopper spend will influence future chance of repeat buys.

Vowpal Wabbit

We have now generated a test set and a train set with our features. A line from the train set could look like:

1 '86246 |f offer_quantity:1 has_bought_company_a:243.63 has_bought_brand_180:7.0 has_bought_brand_a_180:23.13 has_bought_brand_q_180:7.0 offer_value:2 has_bought_brand_a_60:14.95 has_bought_company_q:37.0 has_bought_brand_q_30:1.0 has_bought_brand:8.0 has_bought_company_q_30:6.0 has_bought_brand_30:1.0 has_bought_company_q_60:16.0 has_bought_brand_company:1 has_bought_brand_90:6.0 has_bought_company_q_180:19.0 has_bought_company_30:6.0 has_bought_brand_a:28.71 has_bought_company_a_90:106.13 has_bought_brand_q_90:6.0 never_bought_category:1 has_bought_company_180:19.0 has_bought_brand_q:9.0 has_bought_company_a_30:46.74 has_bought_company_q_90:17.0 has_bought_brand_a_30:4.59 total_spend:4140.41 has_bought_company_a_60:100.44 has_bought_brand_q_60:5.0 has_bought_company_a_180:113.21 has_bought_company_60:16.0 has_bought_brand_60:5.0 has_bought_company_90:17.0 has_bought_brand_a_90:20.64 has_bought_company:36.0

The test set should look similar.


Now we run Vowpal Wabbit (we use version 7.1, your results may vary if you pick another version) and train a model with our train set.

vw shop.train.vw -c -k --passes 40 -l 0.85 -f shop.model.vw --loss_function quantile --quantile_tau 0.6


  • -c -k --passes 40 says to use a cache, kill any previous cache and run 40 passes
  • -l 0.85 sets the learning rate to 0.85
  • -f shop.model.vw saves the model
  • --loss_function quantile says to use quantile regression
  • --quantile_tau 0.6 is a parameter to tweak when using the quantile loss function.

We get an average loss of 0.1562.


Now we use the model and the train set to get us predictions:

vw shop.test.vw -t -i shop.model.vw -p shop.preds.txt


  • -t says to test only
  • -i says to load a certain model
  • -p says to store predictions

Feature relevance

vw-varinfo is a small wrapper around Vowpal Wabbit which exposes all variables of a model in human readable form. We can use this to check how relevant our features are. If we run the output of vw-varinfo through our script from the Movie review sentiment analysis post we get the image below:

a feature plot

Kaggle submission

We are almost there. We have a file with predictions (our output from Vowpal Wabbit) and we need to turn this into the Kaggle submission format. You can do this with generate_submission() or write your own script for it.

We lack predictions for about 200 shoppers as their transaction data did not include any product from a category, brand or company on offer. We predict a 0 for these cases.

Our first submission scored position 3 on the public leaderboard. After some tweaking it increased the score and is again at position 3. Using Python, a just-has-to-work mentality and the magnificent tool Vowpal Wabbit we were able to create a competing submission in a few hours.

The scripts takes about 15 minutes to produce this submission from the raw data. It takes under 1 GB of memory and will thus run even on budget laptops.


You can find all code in the MLWave Github repo. If you use this code and find better parameters for Vowpal Wabbit, or find a better feature to use, we’d really appreciate it if you posted it here, on the
dt DataTau post or on the competition forums. Happy competition!

13 thoughts on “Predicting repeat buyers using purchase history”

  1. When i run ./vw, i got
    ./vw: line 6: syntax error near unexpected token `newline’
    ./vw: line 6: `’

    Do you know why? Thanks.

    1. Hi! Here is a validator for VW-formatted files:

      Check that you have no empty lines, that your lines are trimmed from spaces, and that your sample ID starts with ‘ (a single quote) and not `(backtick).

      If you still have problems paste a sample of your .vw dataset and I’ll a deeper look.

      1. Hi, Triskelion, thanks for your reply.
        I got another problem….
        I followed your guide
        and downloaded executable from this link:
        and when i run vw, i got :
        vowpalwabbit\vw.exe is not compatible with the version of Windows you’re running. Check your computer’s system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher.

        My computer is win7 32bit. Maybe i should try this under Unbuntu….
        Thanks for your time.


        1. Heya douhuang! Another reader had found a problem with that tutorial too. I am updating that post now to include simpler building instructions (for any bit system), so you won’t need to download the executable. Hang on!

  2. I’ve been working on this problem in R and outside of VW. I’ve been using regularized regression through glmnet package (with 10-fold cross validation) and have come up short every time of the .59 benchmark (with a best around .578. I’ve run it with the exact features mentioned in this analysis, so I’m not sure why I would get something different. I had understood quantile regression to be performed on continuous variables and not logistic models. Is there something that is happening during the quantile regression?

    I realize this post isn’t about R, but maybe my question could give some insight into what is going on in VW.

    1. Hey, ddunder! I’ve heard more people with R having troubles getting a good score (higher than 59).

      About the quantile regression: It seems that I picked the ‘wrong’ loss function to get a higher score. I think, since we are predicting probabilities of repeat trips, that logistic loss should be optimized (if I understand it correctly).

      Quantile loss: House prices (minimizer = absolute value)
      Hinge loss: 0 or 1 (Is this email a spam mail? Closest 0/1 approximator)
      Squared loss: What is the expected return on a stock? (Minimizer: expectation)
      Logistic loss: Probability of click on advertisement (minimizer: probability)

      Do note that VW does a little more than just regression. Its default is a normalized adaptive importance-invariant update rule. Also quantile regression is good for ranking, which I think helps with AUC evaluation.

      About R, glmnet and the features mentioned I can’t say too much, without updating the forum post on the Kaggle forums. I will soon, once I find the time to optimize my current score (basically this benchmark with a few tweaks).

  3. I have a pretty stupid question but from this whole tutorial there was one thing that I could not figure out.

    “We lack predictions for about 200 shoppers as their transaction data did not include any product from a category, brand or company on offer. We predict a 0 for these cases.”

    How do you find the customerID’s of those ~200 people?
    I spent many hours trying to find them while I know it must be something really simple…

  4. Hi Triskelion,
    Thanks a lot for this informative blog. I have used logistic regression to build my model and I am getting a score of 60.4. I am now trying quantile regression using vowpal wabbit. I had a couple of queries regarding quantile regression:
    a. Do the variables that I use need to be relatively free of correlation as is the case with logistic regression?Or is it fine if the variables have large correlation between them and vowpal wabbit takes care of it?
    b. Do the variables need to have monoticcally increasing or decreasing trend w.r.t. the response rates? Lets take an exmaple of variable like number of transactions. Do the response rates have to increase with increasing number of transactions for quantile regression to work well? Because I have some variables where the response rate initially increases and then decreases as the variable’s value increases. Does quantile regression give out probabilities which match this trend?

    Thank You

  5. Hello Triskelion

    I ve tried to run your features in VW. Your python script worked really well for me. But I ve problems running it in VW.

    I start VW with ./vw, but when I paste your code, I get this:
    vw train.vw -c -k –passes 40 -l 0.85 -f model.vw –loss_function quantile –quantile_tau 0.6
    malformed example!
    words.size() = 13
    vw test.vw -t -i model.vw -p shop.preds.txt
    malformed example!
    words.size() = 6

    0.000000 0.000000 4 4.0 unknown 0.0000 1

    What am I missing? Where I have to save my VW files? I ve tried my admin folder, as well as in the VW folder I run the shell script vw.

    And I asked myself another question. I want to transform your python script into LIBSVM format to run in in Apache Spark. Therefore I need to change input format:

    1 ’86246 |f offer_quantity:1 ….

    : : …

    I can not figure out where set the labels in the python code (1 ’86246 |f ) . Is this a weight after the upper semicolon or something else?

    Kind regards from Zurich,

  6. Hi Triskellon,

    I couldn’t run your python code. ( I am new to python) Whenever i try to run it, i am receiving an syntax error on line 197( print e )
    Kindly help with this code.

  7. Hi,I have some doubt in reduce data set,when I am looking in it,I have seen there are so many transaction is done by a particular customer_id in a particular month.I am worried about how it is possible that a customer can buy more then hundred time, if I consider this is a realistic data set.
    please suggest me this data is correct or not.
    this is summary for year 2013
    Id April February January June March May
    86246 3063 353 292 0 4524 0
    86252 0 384 387 0 6331 0
    12262064 53 41 19 98 29 75
    12277270 63 122 84 36 63 96
    12332190 30 62 68 12 36 29

    1. Hi Aftab,

      First a note: real life datasets are always messy/complex. The is simply the nature of the beast. This dataset is no exception. Noise, outliers, duplication, data collection artifacts, missing values, data health issues, etc. are part of the challenge, both on Kaggle and in academics/business.

      That said: You were right in identifying this issue, good data exploration skills! If I remember correctly this issue also came up during the competition. The hypothesis on the forum was that these customer_id’s denote company credit cards (everybody uses the same card), or some account where the stores book everything on / or books corner cases on. For instance, my supermarket allows me to use their own Bonus Card, when I forget my own.

      Related examples from the real world might be the license plate “NO PLATE” ( ). One funny motorist using this plate got zillions of tickets, because agents entered data in the system with “NO PLATE” when an offending car did not have a license plate. Another real-life example is for hospitals where there are special insurance accounts for illegal immigrants without any paper work. The hospitals book all costs on the same insurance account when treating people without any paper work, resulting in a single account with a huge anomalous amount of costs.

      Another Kaggle competition had a dataset where some auctioned cars had an age of 999. Turns out that when the age of the car was not known they would be registered as the max age possible.

      These issues, however, are often not game-breakers. It is up to you to deal with it, and still get a good evaluation score (it is still very much possible to do better than random guessing on this contest). Perhaps you find that performance improves if you remove these anomalous rows from the dataset. Or you create a separate model that specializes in predicting these corner cases. Get creative! Use the same data mining skills that allowed you to identify this issue.

Leave a Reply

Your email address will not be published. Required fields are marked *