Thoracic Anatomy

How we won 3rd Prize in CrowdAnalytix COPD competition

I recently competed in a CrowdAnalytix competition to predict worsening symptoms of COPD. Our team (Marios Michailidis, Phil Culliton, Vivant Shen and me) finished in the money. Here is how we managed this.

COPD

COPD (Chronic Obstructive Pulmonary Disease) is a lung disease that makes it hard to breath. People with COPD experience exacerbation: A sudden worsening of the symptoms. Symptoms of COPD exacerbation include:

  • Shortness of breath
  • Noisy or irregular breathing
  • Worry and muscle tension
  • Trouble getting to sleep
  • Swollen ankles

Predicting Exacerbation

Our task in this CrowdAnalytix competition is to predict whether COPD patients will show an onset of exacerbation.

Knowing that a patient will have exacerbation can aid medical professionals in making better, more informed, choices.
For instance, medication can help reduce inflammation, possibly shortening or easing the exacerbation.

A solution for this problem could be part of a more general decision making tool.

The data

To predict COPD exacerbation we have a small dataset. Small datasets are common with medical datasets: gathering data from different patients over prolonged periods is a lot of work. Also more data may simply not be available.

There are two types of machine learning practitioner:

1) Those who can generalise from limited data.
@ML_Hipster

Every observation is a patient in the study. A single patient only appears in the train or test set, but never in both. This ensures that solutions generalize to other patients: we train on different patients than the patients we are making predictions for.

The data is anonymized so as to protect the privacy of the participating patients (3309 in total).

The features

We have 60 features of varying usefulness, including:

  • Patient demographics
  • Disease stage
  • Lung function
  • Disease history
  • Questionnaire results
  • Smoking history

Showing which variables were important was required to be eligible for a prize.

Plan of action

We divvied up the task so we could attack the problem from multiple angles.

Benchmark

Marios Michailidis created our team benchmark/baseline. This was a single Logistic Regression model which showed solid cross-validation and a decent score.

Such linear benchmarks are useful to:

  • spot good features,
  • tell you if the problem is more linear or non-linear
  • show you if a model is worth improving on or if it should be discarded

None of our models showed much promise vs. this baseline, except for a very basic Extremely Randomized Trees model from Vivant Shen. She was the only one to experiment with feature selection. This turned out to be a winning intuition.

Extremely Randomized Trees

Extremely Randomized Trees is inspired by Perfect Random Trees. Perfect Random Tree Ensembles create random trees that continue to grow until all leafs have a 100% ratio of the same class.

Extremely Randomized Trees is very similar to the Perfect Random Trees principle and the Random Forests algorithm. It creates trees which are split more randomly, in the extreme case building completely randomly split trees.

It depends on the problem/data if ExtraTrees outperforms RandomForests. In my practical experience with this algorithm it does so more often than not (provided you can use a large amount of estimators), making ExtraTrees a solid choice for a wide variety of dense data sets.

ExtraTrees is also not too popular an algorithm (compared to, say, XGBoost). This may provide you with an edge over other competitors using more popular algorithms.

The added element of randomness makes ExtraTrees more robust against overfit: While individual trees may be highly variant, the trees themselves are much less correlated to another. Ensembles of trees reduce variance and using less correlated trees reduces the generalization error.

Feature Selection

Vivant noticed that halving the dimensionality improved her results.

With tools like Vowpal Wabbit adding a few uninformative features will not damage your results too much. Tree-based models really do benefit by removing these noisy features. If kept in they are sure to appear in a few trees, muddying the results.

Eliminating even more features, to just the top 25 best features gave us the model with the best score.

“It seems that perfection is reached not when there is nothing left to add, but when there is nothing left to take away” — Antoine de Saint-Exupéry

Bagging

The score was made more solid by averaging the result of 10 ExtraTrees models using a different seed/random state.

This modest modeling approach I completely overlooked. I had expected that stacked generalization or at least averaging with different algorithms would help. This served as a warning to never ignore the basics.

It seems the KISS-principle does apply to data science: You should never underestimate simplicity and elegance, especially when it beats your complex approaches.

About the competition

What I liked about this competition was the way prize money was awarded: Even though we ended up in the 7th spot, we still won 3rd prize. This rewards all the well-performing teams, resulting in more diverse model documentation, less statistically insignificant winners, and increased engagement and satisfaction.

I did not like that our best model was picked automatically. This favors using a shotgun approach over looking at CV and carefully selecting your final models. Though it is an improvement over the previous competitions, where there was no private leaderboard at all (thus rewarding overfitting to the leaderboard).

If their software allows for this, I’d recommend that CrowdAnalytix have us select our final models.

Conclusion

Medical diagnostics is, at its heart, a data problem – turning images, lab tests, patient histories, and so forth into a diagnosis and proposed intervention. Recent applied machine learning breakthroughs, especially using deep learning, have shown that computers can rapidly turn large amounts of data of this kind into deep insights, and find subtle patterns. This is the biggest opportunity for positive impact using data that I’ve seen in my 20+ years in the field. Jeremy Howard

The simple, modest and solid approach won vs. overkill solutions. Bigger is not always better. Elegance ruled over brute-force. This is a good thing: Our solution would have little problems in real-life implementation. An ensemble of 1000s of models may have won this competition, but it would not be very useful. Model interpretability is very important for medical models.

In general, it would be nice if more data like this is made available. More and better data can aid research and removes uncertainty. A lot of this data is already out there, hidden in the databases of different hospitals.

Combining this information (in a way like Enlitic is doing for medical photography) is valuable. It should be easier for patients to give permission: “Yes! Use my (anonymized) data for science!”. Besides an improved technical infrastructure such sharing initiatives need a change in mindset for all professionals (and patients) involved.

We each picked a good cause and donated the prize money to:

Thanks to my team mates, CrowdAnalytix, competition host Mohan S, Pierre Geurts et al., and everyone contributing to Scikit-learn.

The intro image for this post came from WikiMedia Commons and is in the Public Domain, courtesy of the medical illustrator Patrick J. Lynch and uploaded by MaterialScientist.

9 thoughts on “How we won 3rd Prize in CrowdAnalytix COPD competition”

  1. Kudos to all the team for winning and donating prize money for the cause. I am learning ML from Andrew sir these days . Just curious about “did you use PCA for dimensality reduction?”.

    1. Thanks.

      We did not do dimensionality reduction, we did feature selection with http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html and I think http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif to select top 25 features.

      According to the Extremely Randomized Trees paper this top k selection (removing noisy features) should increase the variance (of individual trees) a bit, while reducing the bias a lot, which would explain why this worked for us to improve accuracy, and why subsequent bagging of differently seeded ExtraTrees improved score by lowering the variance.

  2. hi all

    i just wanted to know one thing. does domain knowledge make much of a difference? like did u guys read the research papers on existing research on COPD exacerbation prediction.

    in your post above you mention you reduced the number of features to 25. I just wanted to ask did you add any features of your own which might be relevant like did u do some sort of feature engineering before jumping to modelling or u just kept the 60 features as it is. Like when i was doing some reading on COPD, I read that BODE index could be handy in predicting COPD. So did you guys calculated BODE index or anything of that sort based on available features and used it as an extra feature?

    1. We did not use any domain knowledge specific to COPD, nor did we read COPD papers. But I can recommend doing that if you are working on a COPD project, just let cross-validation do the talking and use domain knowledge for inspiration. With a lot of domain knowledge there is a risk of preconceived notions (“this feature must be important, so we keep it in”) — On Kaggle benchmarks made by domain experts are beaten, because they disregard expertise and just focus on performing well.

      We did not add any features. I don’t know if that is even allowed (since you’d be using external data that other competitors do not have access to). But like said: BODE index could be helpful in predicting COPD, so when you have access to this/can calculate it, then it could be helpful to try if it actually increases performance.

      1. Thanks that was helpful
        Do you have any suggestions on how to work on problems with severe class imbalance. Right now I am working on a similar COPD exacerbation prediction and 88% of the labels are negative. I am able to achieve a very good specificity but sensitivity sucks.

        also something tells me that the dataset i am working on is similar to the dataset Crowdanalytics gave you because my dataset aslo has some 60 variables and lot else in common.
        So i would like to know if you remember that how much “auc” did you guys actually achieve?

        Right now I have been able to reach 0.77 without creating any ensemble.

        1. I was going to suggest using AUC/ranking for imbalanced classes, but you are already using this, so I am not sure. It can also be fruitful to look at extremely imbalanced classification problems as outlier detection. There is also sample importance weighting and up- and/or downsampling. Not so comfortable being more specific ( consult an ML dr. 🙂 )

          0.77280 I think we got on private leaderboard, where the winner got ~0.78. I remember this problem being “easy” to get started, with little room for improvement. There may simply not be enough information to get a good class separation. Still… ranking patients for COPD exacerbation could be useful, and for that you need a good ROC AUC score.

  3. @Aditya

    I should have that file somewhere (or if not, someone else on the team may have), but like on Kaggle, I don’t think we are allowed to share this data outside of the competition.

    Perhaps you can try emailing Mohan S. (the competition host), to get clearance to keep working on this (possibly beneficial) problem.

Leave a Reply

Your email address will not be published. Required fields are marked *