It’s been a year since I joined Kaggle for my first competition. Back then I didn’t know what an Area Under the Curve was. How did I manage to predict my way to Kaggle Master?
Toying with datasets and tools
I was already downloading datasets from Kaggle purely for my own entertainment and study before I started competing. Kaggle is one of the few places on the internet where you can get quality datasets in the context of a commercial machine learning problem.
The dataset for the “Amazon.com – Employee Access Challenge” was one of the first datasets that caught my eyes. This post on using Vowpal Wabbit as a classifier on the MNIST dataset with good result made me interested in studying VW:
Or maybe not; such a simple (linear) algorithm really has no right being so good for a problem like this…
The Kaggle forums
It was Miroslaw Horbal who really got me hooked on the Kaggle forums with his post: How to achieve 0.90AUC with Logistic Regression. The thread would amass many replies, all of wonderful sharing nature. I’d revisit the thread long after the contest had ended, to read up on the cross-validation loop and parameter tuning approaches.
Competing on Kaggle
StumbleUpon Evergreen Classification Challenge
This was my first contest. After reading “A bag of words and a nice little neural network” on FastML I felt confident enough to try out Vowpal Wabbit and TfidfVectorizer. It got me a competative score, but not yet top 25%.
That all changed when Abhishek Thakur shared his beat the benchmark post on the forums. Suddenly I had a solid NLP solution in hands, using Pandas, Numpy and SKlearn to put a really good score on the public leaderboard. I really dissected that script, familiarizing myself with every line. I later read that participants in Data Science bootcamps use this benchmark code too when working on this challenge.
After a few weeks I found a way to improve the score by using WordnetLemmatizer from NLTK. I shared this approach in the thread.
I finished top 25%. A few submissions would have scored top 10% had I selected these.
- The Bag-of-words approach to NLP works
- Vowpal Wabbit and Scikit-learn are practical ML libraries
- Kaggle trick: Fitting TFIDF on combined train and test set improves score
- You can learn a lot from the Kaggle competition forums
- Sharing on the Kaggle forums is fun and welcomed
Partly Sunny with a Chance of Hashtags
This was a fun contest, credit Crowdflower, where the task was to predict weather events from the text in tweets. Using Vowpal Wabbit and csoaa I was able to produce multi-class multi-label predictions. I was never really competitive in this competition.
Tim Dettmers‘ post How to get started in Python/SKlearn did not change that. The “general outline” post seemed to give the Kaggle experts a lot of information, but was less useful to me at the time. For example Tim did not specify the algorithm used, yet his approach would annihilate my VW’s score.
I finished in 86th place.
- Insights from similar Kaggle competitions stack up.
- Even when not producing competitive results one is still learning more about the tools and algorithms.
- A Kaggle competition is a game of optimization: every other decent contestant will try out the same algorithms.
Galaxy Zoo – The Galaxy Challenge
I participated in this contest to classify the morphology of distant galaxies, until the train and test datasets were updated and my submissions were removed. I’d never done an object recognition task before.
Taking inspiration from the “central pixels” Go benchmark posted by admin Joyce and foolhardily using insights from the previous NLP competitions, I tried a bag-of-words approach and Gensim’s Document Similarity class.
The Frankenstein bruteforce cosine similarity search on the thousands of image blocks took multiple days to compute. The final model, public leaderboard rank of around rank 50, took me 4.5 days to generate predictions with (well actually over a week… when an update forced a restart).
- I build something akin to what I later learned is a KNNClassifier with distance weighting: calculate the closest image neighbors and weigh their labels according to distance.
- Bag-of-words can even work in image classification techniques
- I can (hope to) beat Kaggle leaderboard benchmarks in tasks I’ve never done before.
Facebook Recruiting III – Keyword Extraction
Here the task was to recommend tags for StackOverflow questions. With 30 days left I took up William Cukierski‘s informal challenge on the forums: Create a good model and do training and testing with less than 1GB of RAM and in under 1 hour. In contrast: Some winner(s) used over 80GB of disk swapped memory and 14 hours to build the model using millions of questions. See: Share your approach?
With 200k possible tags to choose from even VW was too slow for this on a single machine. So for this challenge I had to write my own algorithm. The first algorithmic approach was very simple: remove duplicates using hash-tables, predict a tag when that tag is mentioned anywhere in the title:
How do you install PHP 5.2 on IIS 7? tags predicted: php iis
Using the duplicates in train and test set and this simple tag look-up method I ranked around #20 in 3 minutes. By then, not many had found these duplicates.
After the duplicates became known to all contestants this score dropped significantly against more advanced Bayesian methods.
In the end I managed to score around ~top 33% with a home-made solution that ran in under 1 hour and used 1.6GB of memory. Still pretty proud about that one.
- One can use batch tf-idf fitting on huge datasets
- Count all the things, just don’t count ‘em twice. Keeping just token count dictionaries (and occasional pruning) is a mighty powerful approach when coupled with probabilistic algorithms. Check for duplicates between train and train, test and test and train and test.
- Have some fun with the good datasets (generate topics, look at your predictions, invent simple algo’s to run over it)
- Speed, though not often measured in these competitions, is a huge benefit and something to aim for. It allows for faster iterations and scaleable good-enough solutions.
Personalized Web Search Challenge
This contest was organized by Yandex and featured a massive SERP clicklog dataset. The task was to improve the quality of these search results through personalization.
I had a very simple idea: re-order results based on number of clicks by a searcher. If a search for “widget” would get 10 clicks on the 3rd result and 1 click on the first result, then re-order based on that.
Execution was a little more difficult, with the custom parse log and huge dataset size. Using a database and querying it I managed to produce a reasonable solution.
- There are people who do this specific ML task for a living and then you simply have little chance of beating them. When a team from Yandex joined, they blew the competition away.
- Using an extremely simple hunch, you have a decent chance of beating people who do general ML tasks for a living, provided you manage to execute.
Getting halfway decent
By now I was doing most available Kaggle contests, not top 25% material, but still managing to provide halfway decent solutions in the majority of these.
I wrote a Python benchmark for the Connectomics challenge, the Acquire Valued Shoppers Challenge, DecMeg2014 – Decoding the Human Brain, Criteo Ad Click Prediction, Forest Cover Type detection challenge and Movie Review Sentiment Analysis.
I did well in a few Kaggle Inclass competitions and I got two top 25% positions in forecasting challenges. I was not familiar with forecasting challenges yet so in both PAKDD 2014 and Walmart Recruiting challenge I got lucky with simple hunches, instead of using the more advanced forecasting algorithms.
For the PAKDD 2014 Asus challenge the task was to predict the number of repairs each month for computer parts. I noticed when graphing the monthly repairs that this showed a logarithmic decay. So I simply did a math.log() on the repair values and fitted a linear line on this graph. Minor tweaking of slope and elevation gave a top 25% position.
With the Walmart Recruiting challenge the task was to predict sales on certain days. You could get a top 25% position by predicting the sales from last year for that day. One could further increase the score by “leaderboard-validating”:
if "storeid" == 103 and "departmentid" == 18 then sales = sales * 1.05
if "storeid" == 103 and "departmentid" == 18 then sales = sales * 0.95.
I quickly grew tired of this, but judging from the number of submissions made by the top contestants, this would have worked.
I saw a massive drop in Public leaderboard vs. Private leaderboard in the MLSP Schizophrenia detection challenge. (91AUC to 77AUC). My best submission, and one of the first submissions made, scored 89AUC or top 10. I had picked both my best Public model and a model with 82.5AUC, which I thought to be most robust. I was horribly wrong. With 86 samples in the train set, possibly not unforgivable.
Due to planning issues I did not have enough time to generate the test set with my own improved benchmark code in the Chalearn Connectomics challenge. This necessitated the need for stricter planning.
I messed up model selection in the Allstate competition when I did too many things in the last minute. Instead of selecting a RF and rule-based model build on many hours of trial and error competing for top 10%, I mistakenly picked a fluke model and did not even get top 25% (even my early benchmark submission would have given a top 25%).
I became Kaggle Master mostly through ensemble learning, team work, sharing, powerful ML tools and the law of large numbers.
It would be very nice to win one of the competitions. To consistently place top 10% in these competitions. My teams are ranking well in the currently running Higgs Boson ML challenge and the Criteo Display Ad challenge. Also with few submissions I am ranking decently in the Avito Detect Prohibited Content challenge.
Though I am a Kaggle Master now, I am nowhere near the skill level of most Kaggle Masters I met on this short wonderful journey. There is plenty of room for improvement, as I haven’t even touched tools like PyLearn2, Torch or Theano yet and VW and Sklearn are adding new and exciting features every release.
I think I learned a lot and progressed a lot competing on Kaggle for a year. I am now starting to get a grasp on the more common machine learning problems. If you want to become a Kaggle Master too, next to reading this really insightful article “Learning from the top Kagglers“, I can give these tips:
Practice a lot. Do as many challenges as you can generate a submission for. This buckshot approach will incrementally increase your skills, while hitting the bulls-eye once or twice if you find a good optimization or even something you are good at. You’ll probably make some mistakes too, like overfitting to the leaderboard. This is ok, provided you learn from this.
Study evaluation metrics. Try to really understand AUC. What are you optimizing exactly? To produce a good evaluation you need a thorough understanding of the evaluation metric.
Study the problem domain. Read up on a few business cases and academic papers that mention problems related to the competition problem. What is the state-of-the-art like? You can also get inspiration for feature engineering this way.
Team up. For Kaggle Master you need a top 10% and a top 10 finish. Especially those top 10 finishes are hard, when you don’t know the domain and are competing against other teams of Kaggle Masters. Team up often: to learn how to work as a team on Kaggle competitions, and to meet with others for future co-operations.
Read those forums. Especially the post-competition threads. Take careful note of solutions and approaches. Revisit these threads when similar competitions arise. For example the KDD-cup 2012 Ad click prediction challenge is mighty similar to the, currently running, Criteo Ad Prediction challenge.
Share on the forums. It can help with teaming up. It requires you to think about the problem and your solution for it, from many angles and user perspectives. Though sharing too much can hurt your chances of a good score, as people will just take your approach and ensemble it, I still think my abundant sharing on the forums contributed a lot to my Kaggle Master status.
Ensemble learning. Read up on this and apply it. When done right, it nearly always works. When done expertly, it creates top ~10 submissions with fairly dumb models. Practically I learned about stacking mostly from dissecting code from Emanuele Olivetti and about rank averaging from discussions with KazAnova (Marios Michailidis). Short, but sweet: The entire sklearn.ensemble module is golden.
Experiment. Want to know if you can do Random Forests on bag of words? Find out the point where sparsity, test size and dimensionality start to hamper using RFs? Want to know if RFs are faster to train than logistic regression? Want to find out if you can train RFs on subsampled train chunks and different features and get a reasonable performance?
The answer is always experiment. Try all, discard the bad, keep the good. Then go back to exploring after a year of new knowledge.
Creativity. You can think inside the box, for example: “This is binary classification prediction probabilities, so I use logarithmic regression”. There is black box thinking: “I put labels and features here, I expect good output there, do it anyway you wanna!”.
Outside the box creative thinking is more intangible, but can be a huge benefit: you at least have a chance of beating the more proficient competitors using a common predictable approach.
Pick the right tools and approaches for the job. Before you start tuning individual models or go down a creative rabbit hole you want to find at least a couple of sane algorithmic approaches.
Hyper parameter tuning. Build a CV loop at least once. Tune every last bit of performance from an SVM. This will hopefully teach you where not to look and hone intuition. There are (extremer) situations where a learning rate is set incredibly small or high. Normally it would be a waste of time to search in these ranges.
Have some fun along the way. We machine learning lovers live in exciting times. I think we are on the precipice of a new field of technology, with data science and data engineering on the forefront. It’s a nice budding community online, with places like Kaggle and DataTau. And some very powerful tools are being released for anyone to play with.
The intro image is from Wikimedia Commons and depicts the Floriani Tower in Kraków, created by user Silar