Thoracic Anatomy

How we won 3rd Prize in CrowdAnalytix COPD competition

I recently competed in a CrowdAnalytix competition to predict worsening symptoms of COPD. Our team (Marios Michailidis, Phil Culliton, Vivant Shen and me) finished in the money. Here is how we managed this.


COPD (Chronic Obstructive Pulmonary Disease) is a lung disease that makes it hard to breath. People with COPD experience exacerbation: A sudden worsening of the symptoms. Symptoms of COPD exacerbation include:

  • Shortness of breath
  • Noisy or irregular breathing
  • Worry and muscle tension
  • Trouble getting to sleep
  • Swollen ankles

Continue reading

stacked wood

Kaggle Ensembling Guide

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions.

For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending.

I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself.

This is how you win ML competitions: you take other peoples’ work and ensemble them together.” Vitaly Kuznetsov NIPS2014

Continue reading

old computer room

Online Learning Perceptron

Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).

In logic there are no morals. Everyone is at liberty to build up his own logic, i.e., his own form of language, as he wishes. – Rudolf Carnap (1934) “Logical Syntax of Language”

McCulloch-Pitts Neuron

The birth of artificial neural nets started with the 1943 paper “a
Logical Calculus of the Ideas Immanent in Nervous Activity”
. Two researchers, McCulloch a neurologist, Pitts a logician, joined forces to sketch out the first artificial neurons.

Continue reading

Movieposter Brain that wouldn't die

Detecting Counterfeit Webshops. Part 1: Feature engineering

The number of fake webshops is rising. From 2010 to 2012 the Dutch authority on internet scams received 81.000 complaints. Spammers have moved from running their own webshops to hacking websites or registering expired domain names. This makes classification more difficult.

In this series we will experiment with machine learning to automatically classify the trustworthiness of a webshop (and by extension, any malicious website). The focus will be on fake webshops hosted on the Dutch TLD (.nl) or catering to Dutch users.

Update: New Google Research: The underground market fueling for-profit abuse.

Continue reading

Floriani Tower

Reflecting back on one year of Kaggle contests

It’s been a year since I joined Kaggle for my first competition. Back then I didn’t know what an Area Under the Curve was. How did I manage to predict my way to Kaggle Master?

Early start

Toying with datasets and tools

I was already downloading datasets from Kaggle purely for my own entertainment and study before I started competing. Kaggle is one of the few places on the internet where you can get quality datasets in the context of a commercial machine learning problem.

Continue reading

crowd polling

Human Ensemble Learning

Wisdom of the crowds and ensemble machine learning techniques are similar in principle. Could insights in group learning provide insights in machine learning and vice versa? In this article we will touch upon a variety of more (or less) related concepts and try to build an ensemble view of our own.

“Of all the offspring of Time, Error is the most ancient, and is so old and familiar an acquaintance, that Truth, when discovered, comes upon most of us like an intruder, and meets the intruder’s welcome.” – Charles Mackay (1841), Extraordinary Popular Delusions and the Madness of Crowds

Wisdom of the crowds

The concept of Wisdom of the crowds originated with the book ‘The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations’.
Continue reading

Cat with mouse

Predicting CTR with online machine learning

Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.


Demo with data from this contest added to Vowpal Wabbit. Now that this contest is over: Go here if you want to download the dataset freely made available by Criteo.

The Winning team also won the following Avazu CTR prediction challenge and released Field-Aware Factorization Machines.

Winning team used a mixture of Factorization Machines and GBRT. Code here.

Highest scoring team using Vowpal Wabbit was Guocong Song for 3rd place. Method and code here. In short: Multiple models, polynomial learning and featuremasks.

Highest scoring team using this benchmark (and cubic features) was Silogram for 14th place.

Our team got 29th place out of 718 competing teams.

tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.

FastML wrote a blog about this competition with some tips to improve this benchmark.

The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.

Continue reading

Spam filter

Winning 2 Kaggle in Class Competitions on Spam

Kaggle hosts certain in Class contests that are free to join for everyone. The problems can be simpler than the main competition problems, so this offers a lot of opportunity to experiment and learn. I’ll walk you through two competitions that dealt with spam, and tell you how I won them.

Fascination with spam

I have an unnatural fascination with (web) spam. I am trying to familiarize myself with the spammer mindset and their signatures. At times I actually enjoy reading my spam folder.

Continue reading