Cat with mouse

Predicting CTR with online machine learning

Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.


Demo with data from this contest added to Vowpal Wabbit. Now that this contest is over: Go here if you want to download the dataset freely made available by Criteo.

The Winning team also won the following Avazu CTR prediction challenge and released Field-Aware Factorization Machines.

Winning team used a mixture of Factorization Machines and GBRT. Code here.

Highest scoring team using Vowpal Wabbit was Guocong Song for 3rd place. Method and code here. In short: Multiple models, polynomial learning and featuremasks.

Highest scoring team using this benchmark (and cubic features) was Silogram for 14th place.

Our team got 29th place out of 718 competing teams.

tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.

FastML wrote a blog about this competition with some tips to improve this benchmark.

The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.

Identifying (and serving) those ads that have a higher probability of a click, translates into more profit and a higher quality:

  • More profit, because in most online advertisement networks, advertisers are only charged when a user clicks on their ads.
  • A higher quality, because higher Click-Through-Rates (CTR) implies more relevant advertising (giving a user what he/she wants).

Behavioral Retargeting

Behavorial retargeting is a form of online advertising where the advertisements are targeted according  to previous user behavior, often in the case where a visit did not result in a sale or conversion.


Like Kagglers may try to optimize AUC (Area under Curve) , online advertisers generally aim to optimize Return on Investment (ROI).

Return on Investment = Net profit / investment

Optimizing this formula may be done by:

  • keeping the investment low, for example by looking for cheap, but effective marketing opportunities.
  • increasing net profits, for example by optimizing the conversion ratio (the amount of web users that turn into paying customers).

Behavioral retargeting can increase ROI through:

  • Lowering investment. The networks on which to serve the retargeted ads are usually cheaper than the market average.
  • Increasing profit. Shopping cart abandonment is a real problem for online retailers. These customers were lost in the latest stages of the conversion chain. Retargeting with products catered to the user wins a certain percentage of them back.



Criteo is a retargeting company working together with online retailers to serve online advertisements.

Prediction engine

One of the core products of Criteo is a prediction engine. The engine is capable of predicting effective personalized advertisements on a web scale: real-time optimization of ad campaigns, click-through rates and conversion.

For this prediction engine they are always looking for ways to engineer new informative features.

Behavioral features

A behavioral numerical feature may be the count of previous purchases. A behavioral categorical feature may be the product ID that was added to a shopping cart, but not purchased.

The engineering blog also shows how Criteo is creating web graphs (without using web crawlers) to engineer new features.

Criteo Graph Web


The web as seen by Criteo [from the Criteo Engineering blog]

Show us the data!

For this contest Criteo’s R&D division, CriteoLabs, has released a week’s worth of click data. CriteoLabs is one of the biggest European R&D divisions with a focus on predictive advertising.

The train set has 45.840.616 labelled ads (click-through “1” or not “0”). The test set has 6.042.135 unlabelled ads. Both sets are sampled to have a near even distribution of clicks and non-clicks.

We have 13 columns of integer features (mostly count features), and 26 columns with hashed categorical features.

Though the exact nature of the features is unknown to us, according to a competition admin (Olivier Chapelle), they fall in the following categories:

  • Publisher features, such as the domain of the url where the ad was displayed;
  • Advertiser features (advertiser id, type of products,…)
  • User features, for instance browser type;
  • Interaction of the user with the advertiser, such as the number of the times the user visited the advertiser website.

Our task now is to create a model that will predict the probability of a click.

Online Machine Learning with VW

I think that sticking with your favorite machine learning tool or algorithm for all classification and regression problems, is like picking a chess opening and only playing that against all opponents. You can get lucky and pick a solid opening that will do well most of the times, or you are stuck with a quirky, outdated, or peculiar opening.

Vowpal Wabbit is how Elmer Fudd would pronounce Vorpal Rabbit. I picture the Killer Rabbit of Caerbannog brandishing a +5 vorpal sword. It would give any player an unfair advantage if they found it early on in the game.

One, two! One, two! And through and through. The vorpal blade went snicker-snack!He left it dead, and with its head he went galumphing back. — Jabberwocky

If you follow my blogs, you’ll know that I’ve tried Vowpal Wabbit on many Kaggle competitions. It won’t perform the best on all competitions, though it will perform in most (Multiclass Classification, Regression, online LDA, Matrix Factorization, Structured Prediction, Neural network reduction, Feature interactions), and is a robust addition to many ensembles.

This competition is made for fast online machine learning (with Vowpal Wabbit or Sofia-ML). Models may quickly become outdated, so will probably constantly need to be retrained (we need fast convergence). The collected click data is huge (often far larger than fits into memory) or unbounded (you may constantly collect new categorical feature values).

For a deeper understanding of large scale online machine learning watch this fine video tutorial with John Langford. I can also suggest this relevant seminal paper “A Reliable Effective Terascale Linear Learning System“, where one of the authors is a competition admin.

Picking a loss function

Online machine learning with VW learns from samples one at a time. When your model is trained it iterates through the train dataset and optimizes a loss function.

Vowpal Wabbit has five loss functions:

  • Squared loss. Useful for regression problems, when minimizing expectation. For example: Expected return on a stock.
  • Classic loss. Vanilla squared loss (without the importance weight aware update).
  • Quantile loss. Useful for regression problems, for example: predicting house pricing.
  • Hinge loss. Useful for classification problems, minimizing the yes/no question (closest 0-1 approximation). For example: Keyword_tag or not.
  • Log loss. Useful for classification problems, minimizer = probability, for example: Probability of click on ad.

Without seeing the evaluation page of the competition our hunch should be to pick logaritmic loss. That the competition metric really is logaritmic loss means we gain a massive amount of information even by training a model with Vowpal Wabbit: With its holdout functionality one-in-ten samples will be used to calculate and report on the average loss.

Vowpal Wabbit is fast with its compiled C++ code, but also because it employs the hashing trick. Our feature space is now fixed to a number of bits. We do not need a lookup table in memory to convert text or categorical values into numerical features.


We are going to munge the CSV train and test set to Vowpal Wabbit files (VW files). For this I wrote a Python script

There are no features to engineer, so we will just put the provided numerical and categorical features neatly into their own namespaces.

A single line from the resulting VW train set (~12GB) now looks like:

-1 '10000000 |i I9:181 I8:2 I1:1 I3:5 I2:1 I5:1382 I4:0 I7:15 I6:4 I11:2 I10:1 I13:2 |c 21ddcdc9 f54016b9 2824a5f6 37c9c164 b2cb9c98 a8cd5504 e5ba7672 891b62e7 8ba8b39a 1adce6ef a73ee510 1f89b562 fb936136 80e26c9b 68fd1e64 de7995b8 7e0ccccf 25c83c98 7b4723c4 3a171ecb b1252a9d 07b5194c 9727dd16 c5c50484 e8b83407

Where -1 means “no click”. '10000000 is the ID. |i is the namespace with the numerical features and |c contains the categorical features.

The test set should look similar.


We then train the model with the following command:

./vw -d click.train.vw -f click.model.vw --loss_function logistic

Where -d click.train.vw says to use our train dataset. -f click.model.vw saves the model. --loss_function logistic is setting our loss function. This runs a single pass over the train dataset. Vowpal Wabbit 7.6.1 reports back an average loss of 0.473238. Certainly not the best possible with VW, but a promising start.

Testing and submitting

We are now ready to predict CTR on ~6 million advertisements from the test set. In Vowpal Wabbit we can do this with:

./vw -d click.test.vw -t -i click.model.vw -p click.preds.txt

Where -t says to test only, -i click.model.vw specifies the input model and -p saves our predictions. (Edit: Thanks to Anuj for spotting that I forgot to specify the model when testing, code above updated.)

After running the above command we should have our predictions in a text file (~100MB). We now want to turn these into probabilities (using a sigmoid function) and submit them to Kaggle in CSV format.

You could use the provided script to do this. The submission scores 0.48059 on public leaderboard which is good enough to beat the benchmark and at the time of writing will net you a second place.


All code is available on the MLWave Github account. Improvements and feedback always welcome.


The process for creating baseline benchmark was:

  • All the features were encoded as sparse binary (standard “one hot” encoding);
  • To reduce the dimensionality, the features that took more than 10,000 different values on the training set were discarded;
  • The learning algorithm was linear logistic regression optimized with l-bfgs;
  • The regularization coefficient was set to 1.

We beat the logistic regression benchmark with our first submission. Vowpal Wabbit truly is an industry-ready tool for machine learning on large and high dimensional datasets.

Now the fun part starts: to optimize the parameters (BTW: Vowpal Wabbit has a utility script for this). Aim for the lowest average logistic loss on your own laptop and produce real business value.

Also as the models increase in complexity, so do their training times. For now we were able to keep the munging, training and predicting under 1 hour.


I’d like to thank Kaggle for hosting the competition and Criteo for sharing this amazing dataset. FastML for introducing Vowpal Wabbit to my game. And of course all the authors of Vowpal Wabbit, by name its principal developer: John Langford.

In particular I’d like to thank the Kaggle competitors. When not providing tips on the forums, you are after my leaderboard scores. I know I will probably not be #1 for much longer and I predict that it will only make me work harder, better, faster.

31 thoughts on “Predicting CTR with online machine learning”

  1. Hi,
    I am trying to predict CTR from set of 18 features much similar to Criteo’s CTR Prediction contest on kaggle.

    The data given on kaggle is sampled so that number of 1s (Clicked = True) and 0s (Clicked = False) are almost equal. However, the data I am using contains impression an click data for one day with around 0.4% CTR (i.e 4 out of 1000 1s and rest 0s). I wanted to ask whether such sampling is compulsory for good prediction? If sampling is necessary, how do I proceed so that I do not loose any pattern. Also, what is a good evaluation metrics for such a data set?

    P.S. : I have tried logistic regression using scikit learn and got poor results. However, after sampling the results were a little better but still not satisfactory.


    1. Hi Anuj,

      Class imbalance problems are a tricky problems in machine learning. Convergence to a good-enough solution can be a troublesome task for many algo’s, especially when with online learning the order of labeled samples matters.

      Sub-sampling is often possible and eases the entire process, and the results. For VW somthing quite similar can be achieved using sample weights / sample importance. You can add a weight after the label, omit weight and it is assumed to be 1:

      -1 ‘very_common |f happens all the time
      1 50 ‘rare_anomaly |f rare thus 50x importance

      This way one keeps all the data in train set (avoid subsampling majority class), while adjusting the importance of learning from the minority class.

      Evaluation metric also depends on the problem context itself (for example: do you care about the negative class? do you want probability estimates, ranks or 0-1 classification? Do you care about false positives?).

      Some metrics can be deceiving when you have a huge class imbalance, like accuracy or AUC. Never predicting the minority class can give scores of 0.99+. This leaves the learning algo little room for improvement. For Kaggle contests with imbalanced data I’ve seen metrics like mean squared error and log loss (log loss is used in this competition, so if you say your problem is similar, I’d research that).

      1. Thanks, that will be very helpful.
        However, I thought first I should play with the ‘nicely sampled’ criteo dataset and then move to my data. So, I am starting with VW itself. I am curious that how does the statement for testing:

        ./vw -d click.test.vw -t -p click.preds.txt

        know what model to use for predicting?
        I am beginner in ML and VW, so excuse me if I have missed something very obvious.

        P.S. : I have used this with the 12GB criteo train set and I am getting all 0s while predicting with test set and corresponding probability as 0.5


  2. ./vw -d click.test.vw -t -p click.preds.txt

    for testing is incorrect. You indeed need to specify the model to use:

    ./vw -d click.test.vw -t -i click.model.vw -p click.preds.txt

    -i for input_model.

    If you are trying this Criteo challenge with VW, just follow along with the benchmark tutorial. It should give probabilities after running the output from VW through a sigmoid function.

    Good luck!

      1. Thank you! I see now I simply forgot to add that in the article under testing. I updated it. Let me know if you spot anything else like this. I will go back to testing all commands before posting :).

    1. Hey, Thanks for the entire model, it works fine. How did u calculate the overall score . The overall score is different from the individual probabilities

    1. Nope, that is not going to happen. Check the forums, maybe someone posts code for R or batch fitting in Python scikit-learn.

  3. Thanks a lot for making your code open. Its a terrific way for people like me appreciate the wonder that is vowpal wabbit.

    In your code for generating csv to vw, the categorical variables are all clumped together i.e you do not make a disinction between C1 and C15 but rather treat all of them as tokens. Am i correct in this interpretation.

    Why does this matter. As you can see here
    A variable like C9 which takes only 3 values is giving us some additional information than being tagged with everything else.

    Thanks again

    1. Yes, that is the correct interpretation. I basically lump all these categorical values together as a bag of tokens.

      Note that this is not the soundest way. Kaggler Foxtrot over at FastML has written an article about encoding the categorical features in a correct manner:

      For some unknown reason my wonky way works a bit better than the “right” way as proposed in the FastML article.

      Good luck!

  4. Thanks for the introduction to vowpal wabbit. What do you recommend for tuning. Also how do you call vw-varinfo or top errors. I prepended the command, and I keep on getting command not found

    1. I don’t think ./vw is in your PATH (so you can call it from any directory). Probably the easiest would be to copy the script from /utl/ to the directory where the ./vw executable resides. Then run it from that directory.

      For tuning there is vw-hypersearch and the almighty hunch.

  5. What a great site! I am curious your background – have you formally studied ML?

    I had a quick question on VW if you dont mind. I am trying to wrap my head around it. Lets say that we have a (silly) data set with 3 features.
    1) State (one of the 50 American states)
    2) Weather (three possible values: cloudy, sunny, raining)
    3) Message (a text string/document)

    If you set up your input as (the first example record with a value of 1 for the target variable):

    1 |state newyork |weather cloudy |message This is my first VW attempt

    Will VW perform in effect a bag-of-words representation on ‘message’, while seeing ‘state’ and ‘weather’ has being categorical variables that are self contained?

    Finally, if we use -q stateweather will VW create an interaction between state and weather? While if we used -q messagemessage will VW create interactions between every column in the bag of words representation of message (which would be somewhat nonsensical, but I am just wondering if I understand this yet)?

    Thanks so much if you can provide any clarity!

    1. I have not formally studied ML, unless you count the university of Kaggle :).

      Will VW perform in effect a bag-of-words representation on ‘message’, while seeing ‘state’ and ‘weather’ has being categorical variables that are self contained?

      Yes. I am not 100% sure about using full strings for namespaces (ambiguity), so Ill substitute them with the first letter.

      1 |s newyork:1 |w cloudy:1 |m This:1 is:1 my:1 first:1 VW:1 attempt:1

      You could do -q sw to create a newyork*cloudy feature interaction. You could do --ngram m3 to create 3-grams from the messages. -q mm would create q-grams on the message (this would include 2-grams, but also 2-grams with all the --nskips). --cubic mmm Would create all 3-grams, but also all other cubic feature interactions. By that point you probably want a huge -b bitsize and regularization.

      I think you are indeed close to understanding it (or at least, understanding it better than I do).

      1. Thats great! One question, did you mean to say “would create q-grams on the message” or 2-grams? I assume that -q is not the order of the gram.

        1. q-gram is a fancy word I just made up. -q is not the order of the gram, as I understand it, every possible feature interaction is created.

          1 ‘notsure |f i give up

          with -q ff would give: 2grams i*give, give*up, but also i*up (can be seen as 2gram with 1-skip). I dont know if -q also generates the reverse: give*i, up*give. I think so. If you find that out you already understand better than me :).

          1. I am not sure :). Since I can’t find this in the documentation it will probably be a perfect question to ask on the mailing list. Or use vw-varinfo and -q on a simple dataset to see what features are used by the model.

            Maybe I’ll collect a bunch of these questions (I have a few more) and make a single post for this. That way we won’t distract Langford (or Faigon) too much when he is busy conning multi-armed bandits :).


  6. A small test seems to confirm that VW does create both a*b and b*a in the interactions. I wonder if that is a bug? I cant think why you would want that redundancy and what it means for things like matrix factorization.

    1. I don’t think it’s a bug. With VW it is always a feature :).

      Quadratic or feature^2 would indeed mean all the combinations. Thank you for the test though as it confirms my intuition about “q-grams”. If it only created half a matrix of interactions, then information about word/token order would get lost:

      “is not” and “not is” would have the same interaction and no way for the model to distinguish.

      1. If I knew more about the hashing trick then this might make more sense to me. I think of it in terms of one-hot-encoding / bag of words model where is*not and not*is is the same thing. Generally for combinations of categories, you would not want repeated features like this. I think about factorization models (user / movie interactions) – and again think user1*movie1 and movie1*user1 is redundant.
        But maybe the hashing trick makes this different?

  7. I am curious about this, how is such a model used- do you know? Would the various options (values of the variables) be run through the model to see for a given individual, should we give then ad 1 or ad2 or ad3 or……adN?

      1. I am not sure (keep this in mind), but I’ll hazard a guess.

        For ads, latency is important: You get a few milliseconds to start serving relevant ads. So probably it is a fast (linear) model which replies with an ad id in response to some variables.

        There are also ad auctions: companies can make a bid on a certain ad space. They get milliseconds to reply with a bid and when they win, they get to serve their ad.

        Also, with CTR models, concept drift is a big thing: Models get stale fast, and need to be retrained at least every day, maybe even every hour, or real-time.

        I also suspect some caching going on: Your computer or profile gets a unique ID. In response to this ID an ad is served. These ads could be pre-selected (so a different form of testing: it has already calculated the relevant ads for your user ID, and only needs this ID to serve you personalized ads).

        1. Great hazarded guess! 🙂 So such a model, in your thinking is used to simulate options and essentially take the argmax = which ad choice that could be presented is predicted to offer the highest CTR?

          1. Yes, basically.

            Or maybe there is hierarchical sequential modelling goin on?

            Very simple models first divide the ads: if male then [list_of_ads], then a more complex (still linear, or GBRT) does the testing on 100-1000 ads and you select the ad with argmax (probability of click).

            Langford does logarithmic time prediction:

            Also this Quora QA with Facebook Applied ML Director was relevant:

            For example maximizing the total amount of auction value created by our Ads ranking system, which is powered by an ML system that predicts the relevance of an ad to a person. Auction value will be positively impacted by two things:

            1) Prediction accuracy.
            2) Number of candidate Ads that can be evaluated by the most accurate predictor (typically there are strict latency constraints that motivate cascade approaches where faster by less accurate predictors refine the candidate set).

            We need to maximize accuracy while minimizing the computational effort required to make predictions at serving time.


  8. Hi, you mention featuremasks? I have tried to find out what this means exactly but it makes very little sense to me for this case. Would you be able to explain what this is or post a link which might help explain this?

    Many thanks for such a great resource!

    1. vw -d data.vw --l1 0.001 -f features.model
      vw -d data.vw -i features.model --feature_mask features.model -f final.model

      You use L1-regularization from a simple model to find the features that do not contribute any (or just a tiny amount of) signal. Then you use this as a mask for another model. This other model will ignore the meaningless / noisy features found by the first model.

Leave a Reply

Your email address will not be published. Required fields are marked *