Spam filter

Winning 2 Kaggle in Class Competitions on Spam

Kaggle hosts certain in Class contests that are free to join for everyone. The problems can be simpler than the main competition problems, so this offers a lot of opportunity to experiment and learn. I’ll walk you through two competitions that dealt with spam, and tell you how I won them.

Fascination with spam

I have an unnatural fascination with (web) spam. I am trying to familiarize myself with the spammer mindset and their signatures. At times I actually enjoy reading my spam folder.

I purposely did not install a comment spam filter on this blog. Instead I manually remove 1000s of spam comments, labeling them in the process. Now if only I would get some more ham comments, then I can build a comment spam filter based on actual domain expertise.

ADCG SS14 Challenge 02 – Spam Mails Detection

Anomaly Detection ChallengesIn this anomaly detection contest you are tasked to find out spam in an email corpus. The contest is part of a Master program at the Technical University of München.

We are to predict 1829 mails into spam or ham (The test set). We have 2502 labeled mails to do this (The train set). We have both the email headers and the email contents. The format for the emails is .eml. The evaluation metric is classification accuracy.

Redefining the problem space

I have much respect for my fellow contestants. To receive course credit they had to design and implement their own algorithms. So they had to get their hands dirty writing Bayesian spam filters, while I could make use of algorithm libraries.

I did try to create my own algorithms at first: One rule-based and one based on the seminal paper by Paul Graham “A plan for spam“, but later on I redefined the problem: Just write the best spam filter you possibly can, don’t reinvent the wheel, but make use of open source software and algorithm libraries.

Vowpal Wabbit as an online spam filter

Vowpal Wabbit has been used as a spam filter for Yahoo mail. (Academic) research on its performance can be found in:

Vowpal Wabbit can “eat” raw text for its features. You just have to clean the text a little:

1 'email_spam1 |f this is a spam email
0 'email_ham44 |f this is a ham email

This, to VW, is the same as:

1 'email_spam1 |f this:1 is:1 a:1 spam:1 email:1
0 'email_ham44 |f this:1 is:1 a:1 ham:1 email:1

LibSVM format demands numerical features, which requires a dictionary->key lookup. VW uses the hashing-trick, making this more memory-friendly.

One could try TF*IDF on these features, which would probably improve the score a bit, but that kinda defeats the purpose of online machine learning. Vowpal Wabbit does ok with binary vectors anyway.

Results

Running VW on the train set and generating predictions for the test set generated a top 1 score. Using an .eml reader I got to understand my datasets better. I noticed that a few hand-written static rules would be beneficial:

  • If the sender has been messing with the sender field, for example by blanking it out, then the mail is likely spam.
  • If the date of the email is in the future (“3601-12-08”), then it is likely spam.
  • If the email contains the 2-gram: “Dear Friend”, then it is likely spam.
  • If the email is in all-caps then it is likely spam.
  • If the email contains spam reports from other classifiers (like SpamAssassin), then these reports contains very indicative data on the mail being spam or not.

The last rule helped the most. It is a bit weird that the train and test dataset contained these reports, as with a little parsing you can make your filter behave as good as commercial filters. But not all emails had these reports, or were scanned by a filter.

Using VW and the rule-based filter got me very close to 99%.

Data engineering

Since this contest had no rule on using external data, I finally got to try my hand at “data engineering”. Finding datasets to add to your model is a useful skill to have, and requires creativity, much like feature engineering does. Most main Kaggle contests explicitly forbid the usage of external data though, and probably for good reasons.

I exported my own GMail spam box and inbox to add to the datasets. I downloaded spam and ham datasets where I could find them. I managed to increase the train set by 1000%. But when I ran my deduplication scripts I found out to my surprise that I now had duplicate emails in train and test set… Turns out that the test set was generated from a dataset (from 2010) that I had found too.

As this would definitely increase my score, and it would be unfair, I deleted those emails from my train set. This is a dangerous mistake to make, yet I am glad I made it and caught it on time. I wonder how much of a problem this is with academic benchmarks. As the spam-fighting world is a small world, many datasets may have some overlap. Then it matters to which degree algorithms can detect these duplicates.

But even with an all-unique non-test train set I increased the score to 99.89%. This shows that more training data is usually better. Vowpal Wabbit isn’t scared of this extra data: it is build to be scalable to millions of samples and thousands of features.

Hitting the plateau

I couldn’t increase the score much further. It seemed I had hit my theoretical limit of catching spam. A 100% classification accuracy would not be possible. I am not convinced the ground truth of the test set is even 100% correct.

Further small attempts to improve this score were fruitless, but it had proven to be enough to win this competition.

What I learned

Papirusy z Edhellond (Papyrus scrolls from Edhellond)

Kaggle Contest badgeFor this competition I had to rely on Google Translate, as the competition page was written in Polish. This added a little to the complexity of the otherwise basic challenge.

For this competition the features were already generated and we didn’t have access to the raw original emails. We have features like:

  • capital_run_length_total
  • char_freq_!
  • word_freq_credit

for 3221 emails in train set. And we have to predict 1380 emails in test set. The evaluation metric is Area under ROC.

Results

I first tried Vowpal Wabbit. This was not enough to beat the leaders on leaderboard. A single research team had a very high score and over 200 submissions. I suspected they were overfitting to the leaderboard, and this was later confirmed, as they dropped to rank 3 when private leaderboard was revealed.

I switched to Scikit-learn, focusing on their “ensemble” algo’s: RandomForestClassifier, ExtraTreesClassifier and GradientBoostingClassifier.

All of these produced very high scores. ExtraTreesClassifier did well on local Cross-Validation, but worse on the public leaderboard, so I mistakenly settled on RandomForestClassifier.

I tried all the different parameters using a gridsearch, finding that “entropy” criteria split outperformed “gini” criteria split. I also tried AdaBoosting large RandomForests, which upped the score a little.

I then tried two ensemble methods to increase the score even more: Simple averaging/bagging my 3 best submissions. And a new technique for me called Stacked Ensemble Learning: This performed really well using 6 small trees with different settings.

When the leaderboard was revealed my best submission turned out to be an ExtraTreesClassifier, second best was stacked ensemble learning. I had picked the Adaboosted RandomForestClassifier as my model which, while not the best possible score, did gave me the number one spot. See here my private and public leaderboard scores for different algo’s and different settings.

What I learned

  • Adaboosting large trees can increase the score
  • It is not impossible to compete in a contest that is written in another language.
  • Trust local Cross-validation, even when leaderboard tells you differently
  • Trust and hone instincts: if you think a model is more robust, but ranks lower than other models on public leaderboard, do not discard it.
  • Random Forests are very powerful and you should add them to your arsenal if you want to win Kaggle competitions
  • Ensemble learning can make the difference between winning a competition and doing well in a competition.
  • Simple bagging works, but stacking/blending usually works better.
  • 6 weak models can beat 1 strong model

Code

I have uploaded the winning model code to Github. Feel free to snoop around in it or use it improve the final score. I suspect that Adaboosted ExtraTrees will increase the score even more.

Conclusion

Kaggle In Class contests are fun to do, the problems are mostly entry-level, and these contests allow for a maximum of experimentation and learning. If you are allowed to join one, try it out some time. This competition has you predict CPU load on a server cluster (See also this related Google blogpost: Better data centers through machine learning).

The intro image to this post came from Wikimedia Commons and was created by Peter Eich.

6 thoughts on “Winning 2 Kaggle in Class Competitions on Spam”

    1. Thank you for being one of two true negatives between a 1000 spam comments :). It required a trained eagle eye to spot this one!

  1. You’re not detecting web spam. You’re detecting blog comment spam, which is a different problem. Blog comments should have some statistical resemblance to the blog content and to other blog comments. A content-based method is thus effective.

    True web spam, aimed at search engines, is a tougher problem. Because there’s so much money to be made doing it, the quality of web spam is reasonably good and increasing. Content-based methods are starting to fail as the quality of spammy content improves.

    We approach web spam by finding the business behind the web site and checking out their business background. This is effective. Too effective. If Google used it, their AdSense revenue would drop substantially. Google’s anti-spam efforts are primarily cosmetic, aimed only at extreme offenders.

    There’s a fundamental conflict of interest between running a search engine and running a third-party advertising business.

    1. There is a post soon where I try to target web-spam and web-scams. One method I explore is indeed checking the background of the company (how long have they been around, do they have (negative reviews), are they registered or certified,is their hosting trustworthy etc. ).

      One way to get spam sites is to simply follow the links in spam comments and spam email. Spammy comments and spammy emails link to spammy webcontent.

      I suspect Google is very proficient at catching spam and will do advanced checks to get more signal: like see if a company name appears in chamber of commerce registration or in offline print media.

      This conflict of interest I think Larry Page spoke of in his paper on what would become Google Search. I do think that Google favors user experience and index/ad quality over making a few more cents with crappy ads and cheap spammy organical links.

  2. 8 Appendix A: Advertising and Mixed Motives

    Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

    In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm. http://infolab.stanford.edu/~backrub/google.html “The Anatomy of a Large-Scale Hypertextual Web Search Engine”

Leave a Reply

Your email address will not be published. Required fields are marked *