Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.
Highest scoring team using this benchmark (and cubic features) was Silogram for 14th place.
Our team got 29th place out of 718 competing teams.
tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.
FastML wrote a blog about this competition with some tips to improve this benchmark.
The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.
Identifying (and serving) those ads that have a higher probability of a click, translates into more profit and a higher quality:
- More profit, because in most online advertisement networks, advertisers are only charged when a user clicks on their ads.
- A higher quality, because higher Click-Through-Rates (CTR) implies more relevant advertising (giving a user what he/she wants).
Behavorial retargeting is a form of online advertising where the advertisements are targeted according to previous user behavior, often in the case where a visit did not result in a sale or conversion.
Like Kagglers may try to optimize AUC (Area under Curve) , online advertisers generally aim to optimize Return on Investment (ROI).
Return on Investment = Net profit / investment
Optimizing this formula may be done by:
- keeping the investment low, for example by looking for cheap, but effective marketing opportunities.
- increasing net profits, for example by optimizing the conversion ratio (the amount of web users that turn into paying customers).
Behavioral retargeting can increase ROI through:
- Lowering investment. The networks on which to serve the retargeted ads are usually cheaper than the market average.
- Increasing profit. Shopping cart abandonment is a real problem for online retailers. These customers were lost in the latest stages of the conversion chain. Retargeting with products catered to the user wins a certain percentage of them back.
Criteo is a retargeting company working together with online retailers to serve online advertisements.
One of the core products of Criteo is a prediction engine. The engine is capable of predicting effective personalized advertisements on a web scale: real-time optimization of ad campaigns, click-through rates and conversion.
For this prediction engine they are always looking for ways to engineer new informative features.
A behavioral numerical feature may be the count of previous purchases. A behavioral categorical feature may be the product ID that was added to a shopping cart, but not purchased.
The engineering blog also shows how Criteo is creating web graphs (without using web crawlers) to engineer new features.
The web as seen by Criteo [from the Criteo Engineering blog]
Show us the data!
For this contest Criteo’s R&D division, CriteoLabs, has released a week’s worth of click data. CriteoLabs is one of the biggest European R&D divisions with a focus on predictive advertising.
The train set has 45.840.616 labelled ads (click-through “1” or not “0”). The test set has 6.042.135 unlabelled ads. Both sets are sampled to have a near even distribution of clicks and non-clicks.
We have 13 columns of integer features (mostly count features), and 26 columns with hashed categorical features.
- Publisher features, such as the domain of the url where the ad was displayed;
- Advertiser features (advertiser id, type of products,…)
- User features, for instance browser type;
- Interaction of the user with the advertiser, such as the number of the times the user visited the advertiser website.
Our task now is to create a model that will predict the probability of a click.
Online Machine Learning with VW
I think that sticking with your favorite machine learning tool or algorithm for all classification and regression problems, is like picking a chess opening and only playing that against all opponents. You can get lucky and pick a solid opening that will do well most of the times, or you are stuck with a quirky, outdated, or peculiar opening.
Vowpal Wabbit is how Elmer Fudd would pronounce Vorpal Rabbit. I picture the Killer Rabbit of Caerbannog brandishing a +5 vorpal sword. It would give any player an unfair advantage if they found it early on in the game.
One, two! One, two! And through and through. The vorpal blade went snicker-snack!He left it dead, and with its head he went galumphing back. — Jabberwocky
If you follow my blogs, you’ll know that I’ve tried Vowpal Wabbit on many Kaggle competitions. It won’t perform the best on all competitions, though it will perform in most (Multiclass Classification, Regression, online LDA, Matrix Factorization, Structured Prediction, Neural network reduction, Feature interactions), and is a robust addition to many ensembles.
This competition is made for fast online machine learning (with Vowpal Wabbit or Sofia-ML). Models may quickly become outdated, so will probably constantly need to be retrained (we need fast convergence). The collected click data is huge (often far larger than fits into memory) or unbounded (you may constantly collect new categorical feature values).
For a deeper understanding of large scale online machine learning watch this fine video tutorial with John Langford. I can also suggest this relevant seminal paper “A Reliable Effective Terascale Linear Learning System“, where one of the authors is a competition admin.
Picking a loss function
Online machine learning with VW learns from samples one at a time. When your model is trained it iterates through the train dataset and optimizes a loss function.
Vowpal Wabbit has five loss functions:
- Squared loss. Useful for regression problems, when minimizing expectation. For example: Expected return on a stock.
- Classic loss. Vanilla squared loss (without the importance weight aware update).
- Quantile loss. Useful for regression problems, for example: predicting house pricing.
- Hinge loss. Useful for classification problems, minimizing the yes/no question (closest 0-1 approximation). For example: Keyword_tag or not.
- Log loss. Useful for classification problems, minimizer = probability, for example: Probability of click on ad.
Without seeing the evaluation page of the competition our hunch should be to pick logaritmic loss. That the competition metric really is logaritmic loss means we gain a massive amount of information even by training a model with Vowpal Wabbit: With its holdout functionality one-in-ten samples will be used to calculate and report on the average loss.
Vowpal Wabbit is fast with its compiled C++ code, but also because it employs the hashing trick. Our feature space is now fixed to a number of bits. We do not need a lookup table in memory to convert text or categorical values into numerical features.
We are going to munge the CSV train and test set to Vowpal Wabbit files (VW files). For this I wrote a Python script csv_to_vw.py.
There are no features to engineer, so we will just put the provided numerical and categorical features neatly into their own namespaces.
A single line from the resulting VW train set (~12GB) now looks like:
-1 '10000000 |i I9:181 I8:2 I1:1 I3:5 I2:1 I5:1382 I4:0 I7:15 I6:4 I11:2 I10:1 I13:2 |c 21ddcdc9 f54016b9 2824a5f6 37c9c164 b2cb9c98 a8cd5504 e5ba7672 891b62e7 8ba8b39a 1adce6ef a73ee510 1f89b562 fb936136 80e26c9b 68fd1e64 de7995b8 7e0ccccf 25c83c98 7b4723c4 3a171ecb b1252a9d 07b5194c 9727dd16 c5c50484 e8b83407
-1 means “no click”.
'10000000 is the ID.
|i is the namespace with the numerical features and
|c contains the categorical features.
The test set should look similar.
We then train the model with the following command:
./vw -d click.train.vw -f click.model.vw --loss_function logistic
-d click.train.vw says to use our train dataset.
-f click.model.vw saves the model.
--loss_function logistic is setting our loss function. This runs a single pass over the train dataset. Vowpal Wabbit 7.6.1 reports back an average loss of 0.473238. Certainly not the best possible with VW, but a promising start.
Testing and submitting
We are now ready to predict CTR on ~6 million advertisements from the test set. In Vowpal Wabbit we can do this with:
./vw -d click.test.vw -t -i click.model.vw -p click.preds.txt
-t says to test only,
-i click.model.vw specifies the input model and
-p saves our predictions. (Edit: Thanks to Anuj for spotting that I forgot to specify the model when testing, code above updated.)
After running the above command we should have our predictions in a text file (~100MB). We now want to turn these into probabilities (using a sigmoid function) and submit them to Kaggle in CSV format.
You could use the provided script vw_to_kaggle.py to do this. The submission scores 0.48059 on public leaderboard which is good enough to beat the benchmark and at the time of writing will net you a second place.
All code is available on the MLWave Github account. Improvements and feedback always welcome.
- All the features were encoded as sparse binary (standard “one hot” encoding);
- To reduce the dimensionality, the features that took more than 10,000 different values on the training set were discarded;
- The learning algorithm was linear logistic regression optimized with l-bfgs;
- The regularization coefficient was set to 1.
We beat the logistic regression benchmark with our first submission. Vowpal Wabbit truly is an industry-ready tool for machine learning on large and high dimensional datasets.
Now the fun part starts: to optimize the parameters (BTW: Vowpal Wabbit has a utility script for this). Aim for the lowest average logistic loss on your own laptop and produce real business value.
Also as the models increase in complexity, so do their training times. For now we were able to keep the munging, training and predicting under 1 hour.
I’d like to thank Kaggle for hosting the competition and Criteo for sharing this amazing dataset. FastML for introducing Vowpal Wabbit to my game. And of course all the authors of Vowpal Wabbit, by name its principal developer: John Langford.
In particular I’d like to thank the Kaggle competitors. When not providing tips on the forums, you are after my leaderboard scores. I know I will probably not be #1 for much longer and I predict that it will only make me work harder, better, faster.