Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.
Updates
Demo with data from this contest added to Vowpal Wabbit. Now that this contest is over: Go here if you want to download the dataset freely made available by Criteo.
The Winning team also won the following Avazu CTR prediction challenge and released Field-Aware Factorization Machines.
Winning team used a mixture of Factorization Machines and GBRT. Code here.
Highest scoring team using Vowpal Wabbit was Guocong Song for 3rd place. Method and code here. In short: Multiple models, polynomial learning and featuremasks.
Highest scoring team using this benchmark (and cubic features) was Silogram for 14th place.
Our team got 29th place out of 718 competing teams.
tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.
FastML wrote a blog about this competition with some tips to improve this benchmark.
The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.
Continue reading →