I recently competed in a CrowdAnalytix competition to predict worsening symptoms of COPD. Our team (Marios Michailidis, Phil Culliton, Vivant Shen and me) finished in the money. Here is how we managed this.
COPD (Chronic Obstructive Pulmonary Disease) is a lung disease that makes it hard to breath. People with COPD experience exacerbation: A sudden worsening of the symptoms. Symptoms of COPD exacerbation include:
Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).
In logic there are no morals. Everyone is at liberty to build up his own logic, i.e., his own form of language, as he wishes. – Rudolf Carnap (1934) “Logical Syntax of Language”
The number of fake webshops is rising. From 2010 to 2012 the Dutch authority on internet scams received 81.000 complaints. Spammers have moved from running their own webshops to hacking websites or registering expired domain names. This makes classification more difficult.
In this series we will experiment with machine learning to automatically classify the trustworthiness of a webshop (and by extension, any malicious website). The focus will be on fake webshops hosted on the Dutch TLD (.nl) or catering to Dutch users.
It’s been a year since I joined Kaggle for my first competition. Back then I didn’t know what an Area Under the Curve was. How did I manage to predict my way to Kaggle Master?
Toying with datasets and tools
I was already downloading datasets from Kaggle purely for my own entertainment and study before I started competing. Kaggle is one of the few places on the internet where you can get quality datasets in the context of a commercial machine learning problem.
Wisdom of the crowds and ensemble machine learning techniques are similar in principle. Could insights in group learning provide insights in machine learning and vice versa? In this article we will touch upon a variety of more (or less) related concepts and try to build an ensemble view of our own.
“Of all the offspring of Time, Error is the most ancient, and is so old and familiar an acquaintance, that Truth, when discovered, comes upon most of us like an intruder, and meets the intruder’s welcome.” – Charles Mackay (1841), Extraordinary Popular Delusions and the Madness of Crowds
Wisdom of the crowds
The concept of Wisdom of the crowds originated with the book ‘The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations’. Continue reading →
Good clicklog datasets are hard to come by. Luckily CriteoLabs released a week’s worth of data — a whopping ~11GB! — for a new Kaggle contest. The task is to predict the click-through-rate for ads. We will use online machine learning with Vowpal Wabbit to beat the logistic regression benchmark and get a nr. 1 position on the leaderboard.
Demo with data from this contest added to Vowpal Wabbit. Now that this contest is over: Go here if you want to download the dataset freely made available by Criteo.
Highest scoring team using this benchmark (and cubic features) was Silogram for 14th place.
Our team got 29th place out of 718 competing teams.
tinrtgu posted a very cool benchmark on the forums that uses only standard Python libraries and under 200MB of memory. Now is your chance to play around with online learning, the hash trick, adaptive learning and logistic loss and get a score of ~0.46902 on the public leaderboard.
FastML wrote a blog about this competition with some tips to improve this benchmark.
The competition of optimizing online advertisements with machine learning is like strawberries with chocolate and vanilla: You have large amounts of data, an almost endless variety of features to engineer, and profitable patterns waiting to be discovered.
Kaggle hosts certain in Class contests that are free to join for everyone. The problems can be simpler than the main competition problems, so this offers a lot of opportunity to experiment and learn. I’ll walk you through two competitions that dealt with spam, and tell you how I won them.
Fascination with spam
I have an unnatural fascination with (web) spam. I am trying to familiarize myself with the spammer mindset and their signatures. At times I actually enjoy reading my spam folder.