I recently competed in a CrowdAnalytix competition to predict worsening symptoms of COPD. Our team (Marios Michailidis, Phil Culliton, Vivant Shen and me) finished in the money. Here is how we managed this.
COPD (Chronic Obstructive Pulmonary Disease) is a lung disease that makes it hard to breath. People with COPD experience exacerbation: A sudden worsening of the symptoms. Symptoms of COPD exacerbation include:
- Shortness of breath
- Noisy or irregular breathing
- Worry and muscle tension
- Trouble getting to sleep
- Swollen ankles
Knowing that a patient will have exacerbation can aid medical professionals in making better, more informed, choices.
For instance, medication can help reduce inflammation, possibly shortening or easing the exacerbation.
A solution for this problem could be part of a more general decision making tool.
To predict COPD exacerbation we have a small dataset. Small datasets are common with medical datasets: gathering data from different patients over prolonged periods is a lot of work. Also more data may simply not be available.
There are two types of machine learning practitioner:
1) Those who can generalise from limited data.
Every observation is a patient in the study. A single patient only appears in the train or test set, but never in both. This ensures that solutions generalize to other patients: we train on different patients than the patients we are making predictions for.
The data is anonymized so as to protect the privacy of the participating patients (3309 in total).
We have 60 features of varying usefulness, including:
- Patient demographics
- Disease stage
- Lung function
- Disease history
- Questionnaire results
- Smoking history
Showing which variables were important was required to be eligible for a prize.
Plan of action
We divvied up the task so we could attack the problem from multiple angles.
- Marios Michailidis: Forests and Linear/Logistic Regression
- Phil Culliton: Neural networks (ANN, MLP, DL)
- Vivant Shen: XGBoost and Scikit-learn .ensemble
- HJ van Veen: Vowpal Wabbit and Hodor-ML (Automated stacking with many different algorithms)
Marios Michailidis created our team benchmark/baseline. This was a single Logistic Regression model which showed solid cross-validation and a decent score.
Such linear benchmarks are useful to:
- spot good features,
- tell you if the problem is more linear or non-linear
- show you if a model is worth improving on or if it should be discarded
None of our models showed much promise vs. this baseline, except for a very basic Extremely Randomized Trees model from Vivant Shen. She was the only one to experiment with feature selection. This turned out to be a winning intuition.
Extremely Randomized Trees
Extremely Randomized Trees is inspired by Perfect Random Trees. Perfect Random Tree Ensembles create random trees that continue to grow until all leafs have a 100% ratio of the same class.
Extremely Randomized Trees is very similar to the Perfect Random Trees principle and the Random Forests algorithm. It creates trees which are split more randomly, in the extreme case building completely randomly split trees.
It depends on the problem/data if ExtraTrees outperforms RandomForests. In my practical experience with this algorithm it does so more often than not (provided you can use a large amount of estimators), making ExtraTrees a solid choice for a wide variety of dense data sets.
ExtraTrees is also not too popular an algorithm (compared to, say, XGBoost). This may provide you with an edge over other competitors using more popular algorithms.
The added element of randomness makes ExtraTrees more robust against overfit: While individual trees may be highly variant, the trees themselves are much less correlated to another. Ensembles of trees reduce variance and using less correlated trees reduces the generalization error.
Vivant noticed that halving the dimensionality improved her results.
With tools like Vowpal Wabbit adding a few uninformative features will not damage your results too much. Tree-based models really do benefit by removing these noisy features. If kept in they are sure to appear in a few trees, muddying the results.
Eliminating even more features, to just the top 25 best features gave us the model with the best score.
“It seems that perfection is reached not when there is nothing left to add, but when there is nothing left to take away” — Antoine de Saint-Exupéry
The score was made more solid by averaging the result of 10 ExtraTrees models using a different seed/random state.
This modest modeling approach I completely overlooked. I had expected that stacked generalization or at least averaging with different algorithms would help. This served as a warning to never ignore the basics.
It seems the KISS-principle does apply to data science: You should never underestimate simplicity and elegance, especially when it beats your complex approaches.
About the competition
What I liked about this competition was the way prize money was awarded: Even though we ended up in the 7th spot, we still won 3rd prize. This rewards all the well-performing teams, resulting in more diverse model documentation, less statistically insignificant winners, and increased engagement and satisfaction.
I did not like that our best model was picked automatically. This favors using a shotgun approach over looking at CV and carefully selecting your final models. Though it is an improvement over the previous competitions, where there was no private leaderboard at all (thus rewarding overfitting to the leaderboard).
If their software allows for this, I’d recommend that CrowdAnalytix have us select our final models.
Medical diagnostics is, at its heart, a data problem – turning images, lab tests, patient histories, and so forth into a diagnosis and proposed intervention. Recent applied machine learning breakthroughs, especially using deep learning, have shown that computers can rapidly turn large amounts of data of this kind into deep insights, and find subtle patterns. This is the biggest opportunity for positive impact using data that I’ve seen in my 20+ years in the field. Jeremy Howard
The simple, modest and solid approach won vs. overkill solutions. Bigger is not always better. Elegance ruled over brute-force. This is a good thing: Our solution would have little problems in real-life implementation. An ensemble of 1000s of models may have won this competition, but it would not be very useful. Model interpretability is very important for medical models.
In general, it would be nice if more data like this is made available. More and better data can aid research and removes uncertainty. A lot of this data is already out there, hidden in the databases of different hospitals.
Combining this information (in a way like Enlitic is doing for medical photography) is valuable. It should be easier for patients to give permission: “Yes! Use my (anonymized) data for science!”. Besides an improved technical infrastructure such sharing initiatives need a change in mindset for all professionals (and patients) involved.
We each picked a good cause and donated the prize money to:
Thanks to my team mates, CrowdAnalytix, competition host Mohan S, Pierre Geurts et al., and everyone contributing to Scikit-learn.