Movieposter Brain that wouldn't die

Detecting Counterfeit Webshops. Part 1: Feature engineering

The number of fake webshops is rising. From 2010 to 2012 the Dutch authority on internet scams received 81.000 complaints. Spammers have moved from running their own webshops to hacking websites or registering expired domain names. This makes classification more difficult.

In this series we will experiment with machine learning to automatically classify the trustworthiness of a webshop (and by extension, any malicious website). The focus will be on fake webshops hosted on the Dutch TLD (.nl) or catering to Dutch users.

Continue reading

Floriani Tower

Reflecting back on one year of Kaggle contests

It’s been a year since I joined Kaggle for my first competition. Back then I didn’t know what an Area Under the Curve was. How did I manage to predict my way to Kaggle Master?

Early start

Toying with datasets and tools

I was already downloading datasets from Kaggle purely for my own entertainment and study before I started competing. Kaggle is one of the few places on the internet where you can get quality datasets in the context of a commercial machine learning problem.

Continue reading

crowd polling

Human Ensemble Learning

Wisdom of the crowds and ensemble machine learning techniques are similar in principle. Could insights in group learning provide insights in machine learning and vice versa? In this article we will touch upon a variety of more (or less) related concepts and try to build an ensemble view of our own.

“Of all the offspring of Time, Error is the most ancient, and is so old and familiar an acquaintance, that Truth, when discovered, comes upon most of us like an intruder, and meets the intruder’s welcome.” – Charles Mackay (1841), Extraordinary Popular Delusions and the Madness of Crowds

Wisdom of the crowds

The concept of Wisdom of the crowds originated with the book ‘The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations’.
Continue reading

Cat with mouse

Predicting CTR with online machine learning

Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.

Update: tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.

Update: FastML wrote a blog about this competition with some tips to improve this benchmark.

The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.

Continue reading

Spam filter

Winning 2 Kaggle in Class Competitions on Spam

Kaggle hosts certain in Class contests that are free to join for everyone. The problems can be simpler than the main competition problems, so this offers a lot of opportunity to experiment and learn. I’ll walk you through two competitions that dealt with spam, and tell you how I won them.

Fascination with spam

I have an unnatural fascination with (web) spam. I am trying to familiarize myself with the spammer mindset and their signatures. At times I actually enjoy reading my spam folder.

Continue reading

Europe

How to produce and use datasets: lessons learned

Various studies have focused on the complexities of publishing and using (open) data. A number of lessons can be learned from the experiences of (governmental) data providers, policy-makers, data users, entrepreneurs, competitors and researchers.

Data can be provided by the government, crawled from the web, or generated by sensors. Here are 50 lessons learned in the form of tips and guidelines on creating and using high-quality open datasets.

Continue reading

Patient MEG scan

Predict visual stimuli from human brain activity

Kaggle is hosting a contest where the task is to predict visual stimuli from magnetoencephalography (MEG) recordings of human brain activity. A subject is presented a stimulus (a human face or a distorted face) and the concurrent brain activity is recorded. The relation between the recorded signal and the stimulus may provide insights on the underlying mental process. We use Vowpal Wabbit to beat the benchmark.

Description

Go to the Kaggle competition page to read the full description.

We have the data for 23 participants in the study. All participants have completed around 580 trials. There are time series of brain activity (starting 0.5 seconds before the stimulus is presented, a total of 375 bins). There are 306 channels (from the MEG sensors).

Labels are either 1 (a human face) or 0 (a distorted face). We have the labels for the trials of 16 participants (the train set). We have to predict the labels for the trials of 7 participants (the test set).

Continue reading

Shopping cart

Predicting repeat buyers using purchase history

Another Kaggle contest means another chance to try out Vowpal Wabbit. This time on a data set of nearly 350 million rows. We will discuss feature engineering for the latest Kaggle contest and how to get a top 3 public leaderboard score (~0.59347 AUC).

A short competition description

The competition is to predict repeat buyers (those who redeem a coupon and purchase that product afterwards). For this we have the labelled data (did become repeat buyer, did not become repeat buyer) for about 150.000 shoppers (the train set).

Our task is to predict the labels for about 150.000 other shoppers (the test set). For this we can use a file called transactions.csv. It’s a huge file (unzipped about 22GB) containing nearly 350 million rows. The total amount spend in the transaction data is nearing 1.5 billion.

Continue reading