Kaggle hosts certain in Class contests that are free to join for everyone. The problems can be simpler than the main competition problems, so this offers a lot of opportunity to experiment and learn. I’ll walk you through two competitions that dealt with spam, and tell you how I won them.
Fascination with spam
I have an unnatural fascination with (web) spam. I am trying to familiarize myself with the spammer mindset and their signatures. At times I actually enjoy reading my spam folder.
I purposely did not install a comment spam filter on this blog. Instead I manually remove 1000s of spam comments, labeling them in the process. Now if only I would get some more ham comments, then I can build a comment spam filter based on actual domain expertise.
ADCG SS14 Challenge 02 – Spam Mails Detection
We are to predict 1829 mails into spam or ham (The test set). We have 2502 labeled mails to do this (The train set). We have both the email headers and the email contents. The format for the emails is .eml. The evaluation metric is classification accuracy.
Redefining the problem space
I have much respect for my fellow contestants. To receive course credit they had to design and implement their own algorithms. So they had to get their hands dirty writing Bayesian spam filters, while I could make use of algorithm libraries.
I did try to create my own algorithms at first: One rule-based and one based on the seminal paper by Paul Graham “A plan for spam“, but later on I redefined the problem: Just write the best spam filter you possibly can, don’t reinvent the wheel, but make use of open source software and algorithm libraries.
Vowpal Wabbit as an online spam filter
Vowpal Wabbit has been used as a spam filter for Yahoo mail. (Academic) research on its performance can be found in:
- Collaborative Email-Spam Filtering with the Hashing-Trick
- Sparse Online Learning via Truncated Gradient
- Feature Hashing for Large Scale Multitask Learning
- Technical Tricks of Vowpal Wabbit
Vowpal Wabbit can “eat” raw text for its features. You just have to clean the text a little:
1 'email_spam1 |f this is a spam email 0 'email_ham44 |f this is a ham email
This, to VW, is the same as:
1 'email_spam1 |f this:1 is:1 a:1 spam:1 email:1 0 'email_ham44 |f this:1 is:1 a:1 ham:1 email:1
LibSVM format demands numerical features, which requires a dictionary->key lookup. VW uses the hashing-trick, making this more memory-friendly.
Running VW on the train set and generating predictions for the test set generated a top 1 score. Using an .eml reader I got to understand my datasets better. I noticed that a few hand-written static rules would be beneficial:
- If the sender has been messing with the sender field, for example by blanking it out, then the mail is likely spam.
- If the date of the email is in the future (“3601-12-08”), then it is likely spam.
- If the email contains the 2-gram: “Dear Friend”, then it is likely spam.
- If the email is in all-caps then it is likely spam.
- If the email contains spam reports from other classifiers (like SpamAssassin), then these reports contains very indicative data on the mail being spam or not.
The last rule helped the most. It is a bit weird that the train and test dataset contained these reports, as with a little parsing you can make your filter behave as good as commercial filters. But not all emails had these reports, or were scanned by a filter.
Using VW and the rule-based filter got me very close to 99%.
Since this contest had no rule on using external data, I finally got to try my hand at “data engineering”. Finding datasets to add to your model is a useful skill to have, and requires creativity, much like feature engineering does. Most main Kaggle contests explicitly forbid the usage of external data though, and probably for good reasons.
I exported my own GMail spam box and inbox to add to the datasets. I downloaded spam and ham datasets where I could find them. I managed to increase the train set by 1000%. But when I ran my deduplication scripts I found out to my surprise that I now had duplicate emails in train and test set… Turns out that the test set was generated from a dataset (from 2010) that I had found too.
As this would definitely increase my score, and it would be unfair, I deleted those emails from my train set. This is a dangerous mistake to make, yet I am glad I made it and caught it on time. I wonder how much of a problem this is with academic benchmarks. As the spam-fighting world is a small world, many datasets may have some overlap. Then it matters to which degree algorithms can detect these duplicates.
But even with an all-unique non-test train set I increased the score to 99.89%. This shows that more training data is usually better. Vowpal Wabbit isn’t scared of this extra data: it is build to be scalable to millions of samples and thousands of features.
Hitting the plateau
I couldn’t increase the score much further. It seemed I had hit my theoretical limit of catching spam. A 100% classification accuracy would not be possible. I am not convinced the ground truth of the test set is even 100% correct.
Further small attempts to improve this score were fruitless, but it had proven to be enough to win this competition.
What I learned
- More training data is usually better
- Do not create leakage for yourself, even if it may improve the score
- Vowpal Wabbit eats spam for breakfast
- Always try a quick Vowpal Wabbit run on classification and regression contests
- It is really hard to beat a plateau at 99.9% (The Spam-Filtering Accuracy Plateau at 99.9 percent Accuracy and How to Get Past it)
Papirusy z Edhellond (Papyrus scrolls from Edhellond)
For this competition I had to rely on Google Translate, as the competition page was written in Polish. This added a little to the complexity of the otherwise basic challenge.
For this competition the features were already generated and we didn’t have access to the raw original emails. We have features like:
for 3221 emails in train set. And we have to predict 1380 emails in test set. The evaluation metric is Area under ROC.
I first tried Vowpal Wabbit. This was not enough to beat the leaders on leaderboard. A single research team had a very high score and over 200 submissions. I suspected they were overfitting to the leaderboard, and this was later confirmed, as they dropped to rank 3 when private leaderboard was revealed.
All of these produced very high scores. ExtraTreesClassifier did well on local Cross-Validation, but worse on the public leaderboard, so I mistakenly settled on RandomForestClassifier.
I tried all the different parameters using a gridsearch, finding that “entropy” criteria split outperformed “gini” criteria split. I also tried AdaBoosting large RandomForests, which upped the score a little.
I then tried two ensemble methods to increase the score even more: Simple averaging/bagging my 3 best submissions. And a new technique for me called Stacked Ensemble Learning: This performed really well using 6 small trees with different settings.
When the leaderboard was revealed my best submission turned out to be an ExtraTreesClassifier, second best was stacked ensemble learning. I had picked the Adaboosted RandomForestClassifier as my model which, while not the best possible score, did gave me the number one spot. See here my private and public leaderboard scores for different algo’s and different settings.
What I learned
- Adaboosting large trees can increase the score
- It is not impossible to compete in a contest that is written in another language.
- Trust local Cross-validation, even when leaderboard tells you differently
- Trust and hone instincts: if you think a model is more robust, but ranks lower than other models on public leaderboard, do not discard it.
- Random Forests are very powerful and you should add them to your arsenal if you want to win Kaggle competitions
- Ensemble learning can make the difference between winning a competition and doing well in a competition.
- Simple bagging works, but stacking/blending usually works better.
- 6 weak models can beat 1 strong model
I have uploaded the winning model code to Github. Feel free to snoop around in it or use it improve the final score. I suspect that Adaboosted ExtraTrees will increase the score even more.
Kaggle In Class contests are fun to do, the problems are mostly entry-level, and these contests allow for a maximum of experimentation and learning. If you are allowed to join one, try it out some time. This competition has you predict CPU load on a server cluster (See also this related Google blogpost: Better data centers through machine learning).
The intro image to this post came from Wikimedia Commons and was created by Peter Eich.