Wisdom of the crowds and ensemble machine learning techniques are similar in principle. Could insights in group learning provide insights in machine learning and vice versa? In this article we will touch upon a variety of more (or less) related concepts and try to build an ensemble view of our own.
“Of all the offspring of Time, Error is the most ancient, and is so old and familiar an acquaintance, that Truth, when discovered, comes upon most of us like an intruder, and meets the intruder’s welcome.” – Charles Mackay (1841), Extraordinary Popular Delusions and the Madness of Crowds
Wisdom of the crowds
The concept of Wisdom of the crowds originated with the book ‘The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations’.
This book was published in 2005. The author, James Surowiecki, argues that decisions made by groups are often better than decisions made by a single member of a group.
Surowiecki lists four elements required to form a wise crowd, or a good human ensemble:
- Diversity of opinion. Each person should have private information even if it’s just an eccentric interpretation of the known facts.
- Independence. People’s opinions aren’t determined by the opinions of those around them.
- Decentralization. People are able to specialize and draw on local knowledge.
- Aggregation. Some mechanism exists for turning private judgments into a collective decision.
Oinas-Kukkonen (2008 – Network analysis and crowds of people as sources of new organisational knowledge.) follows up with these observations:
- It is possible to describe how people in a group think as a whole.
- In some cases, groups are remarkably intelligent and are often smarter than the smarter people in them.
- The three conditions for a group to be intelligent are diversity, independence and decentralization.
- The best decisions are a product of disagreement and contest.
- Too much communication can make the group as a whole less intelligent.
- Information aggregation functionality is needed.
- The right information needs to be delivered to the right people in the right place, at the right time, and in the right way.
- There is no need to chase the expert.
Sometimes crowds can throw a fit and produce terrible decisions. These extremes are known to cause failures:
- Extreme homogeneity. There needs to be diversity in a crowd to ensure enough variance in approach, process and private information.
- Extreme centralization. An hierarchical bureaucracy can close off to wisdom of lower level engineers.
- Extreme division. The United States Intelligence Community failed to prevent the 11 September 2001 attacks partly because information held by one subdivision was not accessible by another. The CIA has created Intellipedia to prevent such failures.
- Extreme imitation. When choices are visible to anyone and made in sequence, an “information cascade” can form in which only the first few decision makers gain anything by making a choice, the rest will just follow along and copy.
- Extreme emotionality. Peer pressure, herd instinct and other emotional factors can lead to collective hysteria and bad decisions.
“If others would think as hard as I did, then they would get similar results.” Isaac Newton
Ensemble learning or model averaging is used in machine learning to improve performance by grouping individual models.
An example of voting used in ensemble learning is taking the majority vote. Voting is useful for binary yes-no classification problems. Let’s say you have 5 spam filters. 3 filters predict an email is ham, 2 filters predict an email is spam. The majority vote ensemble would say this email is ham, but with a low certainty.
If all models agree they all vote the same — the ensemble is sure. If there is an even or very close vote, the ensemble is not so sure. Samples where the voting ensemble is not so sure can be interesting to train further models on, in an effort to increase their discriminatory power.
Averaging is averaging all predictions. This can be useful for regression problems. Let’s say you want to predict CPU load in a range of 0-100. You could simply average the prediction output from multiple regression models.
A weighted average is when you give certain models more or less weight. For example when a model with 0.5 weight predicts 0.75 and a model with 1.25 weight predicts 0.6 the ensemble prediction would be adjusted to ( (0.5*0.75 ) + (1.25*0.6) ) / (0.5+1.25) = ~0.64
Instead of averaging the predictions you could also rank these predictions first, and then average the ranks. This approach can help when ensembling predictions from different algorithms, which may have predictions in a different range. Our team used this approach for our 5th place in the KDD-cup 2014.
With binning an individuals model’s output is put into a bucket. For example with 10 buckets all outputs between 0.1 and 0.2 could be put into the second bin. The number and scale of bins is chosen experimentally. An ensemble could consist of counting the contents of each buckets and comparing.
Binning was used by competitors in the KDD-cup ’98. When you want probability estimates, you can not reliably average the ranked outputs. This pdf paper shows their approach with Naive Bayes and SVM. Another use for binning we noticed when Kaggler Guocong Song used binning to turn a regression problem into a classification problem.
Bagging or bootstrap aggregating is a popular approach to ensembling that often increases performance. Many classifiers are trained, each on only subsets of the dataset. Their outputs are then combined through model averaging.
Breiman (1994 – Bagging Predictors) was one of the first to show theoretically and empirically that aggregating multiple versions of an estimator could increase performance and reduce overfitting/complexity.
A random forest is an ensemble estimator that fits a number of individual decision tree classifiers on various subsets of the train dataset. A random forest uses averaging to improve the predictive accuracy and control over-fitting. For scikit-learn see SKLearn.Ensemble.RandomForest and new in version 0.15: sklearn.ensemble.BaggingClassifier. See also the PhD thesis “Understanding Random Forests” from Kaggler Gilles Louppe.
Boosting trains models by sequentially training predictors on samples based on their error rate/certainty. The main idea: Focus new experts on samples that others get wrong. The errors of earlier predictors reveal the samples that are hard to get right, later predictors then focus on getting these samples right.
AdaBoosting is an example of learning from many weaker models to increase the complexity. Gradient boosting is another ensemble boosting technique used by algo’s like Gradient Boosting Machines (GBM).
Stacking uses a model to predict the better performing models in an ensemble. Every individual model creates predictions for the samples in both the train and the test set. These are then combined into a blended train and test set. Then a predictor (for example logistic regression or GBM) is stacked on top of this and it learns which predictors are closer to the labeled truth.
A variant of stacking, employed by competitors in the Netflix Challenge, is feature-weighted linear stacking. Then the stacked model can learn which predictor is often correct for specific features in the sample. See here another Netflix competitor paper on blending and the winner solution on their ensemble blending approach.
Good ensemble practices
Good ensembles are made from models that are:
- Diverse. Different algorithms could be trained on different features and different samples. When correlation is low between two models, ensembling these models usually gives better results, opposed to adding already closely correlated models together.
- Independent. Algorithms will overfit (“memorize”) when they are trained on some data, and their predictions are used to train on that same data. (stratified) K-fold training should be employed when stacking.
- Decentralized. Algorithms can be trained to focus on a small aspect of a classification or regression problem.
A gambler, frustrated by persistent horse-racing losses and envious of his friends’ winnings, decides to allow a group of his fellow gamblers to make bets on his behalf.
He decides he will wager a fixed sum of money in every race, but that he will apportion his money among his friends based on how well they are doing.
Certainly, if he knew psychically ahead of time which of his friends would win the most, he would naturally have that friend handle all his wagers.
Lacking such clairvoyance, however, he attempts to allocate each race’s wager in such a way that his total winnings for the season will be reasonably close to what he would have won had he bet everything with the luckiest of his friends. The 1996 paper introducing AdaBoost “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting“.
The Delphi Method
The Delphi Method was developed at the beginning of the Cold War to forecast the impact of technology on warfare. In 1944, General Henry H. Arnold ordered the first creation of such a forecasting report for the U.S. Army Air Corps. Multiple approaches were tried and tested, and in 1968 the Delphi Method was created at RAND. It is in use as of today.
- Analysis of the Future – The Delphi Method
- An Experimental Study of Group Opinion – The Delphi Method
- On the Epistemology of the Inexact Sciences
In the Delphi Method questionnaires are given for two or multiple rounds. After every round a summery is given where the participants provide reasoning for their judgments. The hope is that the group will converge iteratively to the optimal judgment.
The Delphi method relies on the assumption that group judgments are more valid than individual judgments. And structured group judgments are better than disorganized groups.
Characteristic for the Delphi Method is:
- The anonymity of all participants. This prevents a “bandwagon effect” and peer pressure.
- Structured information flow. A panel director makes sure that problems arising from group dynamics are dealt with, controlling the interactions between the participants.
- Regular feedback. Participants comment on their own forecasts, the responses of others and on the progress of the panel as a whole.
- Role of the facilitator. The person coordinating the Delphi method is usually known as a facilitator or Leader. He or she identifies conflicting viewpoints and works to creating a group consensus.
The Good Judgment Project
The Good Judgment project is an experiment set up by psychologists and intelligence experts. The experiment lets civilians answer questions like “Will North Korea launch a new multistage missile before August 10, 2014?”.
When certain civilians turn out to be correct more often than their (expert) peers, their answers are given increasingly more weight. After a while, one can identify a smaller team of top civilians that may collectively (or even individually) outperform CIA analysts.
“I’m just a pharmacist. Nobody cares about me, nobody knows my name, I don’t have a professional reputation at stake. And it’s this anonymity which actually gives me freedom to make true forecasts.” Elaine Rich on being smarter than a CIA analyst
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision.
Garbage Can Model of Organizational Choice
In 1972 a novel approach to decision theory was proposed by March, Cohen & Olsen. Instead of the static and rigid problem-solver-solution path, they wrote a paper that disconnected problems from solutions and decision makers.
Giant institutions often struggle to act in a timely or effective manner. Take for instance a university. A problem may arise: students complain about lack of sporting facilities. After a lot of actions a final decision is made: build a basketball court on campus.
A year later after budgeting, the construction begins and new problems arise: recalculations show it’s not economically viable, unsafe work traffic, noise complaints. The new decision makers in office are now faced with an older problem and an unrewarding solution.
Meanwhile the students have already solved this problem by joining the nearby fitness gym (and really don’t even like basketball anymore). What does it take for an organization to make a rational choice: To arrange a student discount with the fitness facility and halt construction?
This document describes a proposed design for a globally distributed artificial general intelligence (AGI) for the purpose of automating the world economy. The estimated value is on the order of US $1 quadrillion. The cost of a solution would be of the same order if we assume a million-fold decrease in the costs of computation, memory, and bandwidth, solutions to the natural language, speech, and vision problems, and an environment of pervasive public surveillance.
The high cost implies decentralized ownership and a funding model that rewards intelligence and usefulness in a hostile environment where information has negative value and owners compete for attention, reputation, and resources.A Proposed Design for Distributed Artificial General Intelligence
Marcus Hutter has put up a prize of 50.000$ for the best compression algorithm. Some AI researchers, like Matt Mahoney, view compression as related to understanding. If you can impute words in a sentence, you understand that sentence. If you can impute words in a sentence you won’t have to store these words. Optimal compression of text will require a deep understanding of semantics, syntax, style, culture etc. Better compressors bring us closer to AI.
One of the best compression libraries is the PAQ series. The code trades computational cost and memory vs. a high compression ratio. The PAQ series uses what is known as context mixing: Two or more statistical models are combined (using averaging/bagging, stacking with a Random Forest, Bayesian updates, adaptive weighing, or, in more recent versions neural networks) to form better predictions.
“All data are created equal but some data are more alike than others” Clustering by compression
Robots on Mars
In 1990 Luc Steels wrote the paper ‘Cooperation between distributed agents through self-organisation‘. In the paper he presents a case study for resource gathering on a distant planet using autonomous agents.
For mobile robots to perform well as a group on a distant planet the following criteria are relevant:
- Robustness: The system should be able to recover when a certain action is not correctly executed. When a sample is not picked up, while the command to pick it up was given, this should not lead to further malfunction.
- Graceful performance degradation: Loss of one robot should not be fatal for the group.
- Flexibility: When conditions in the environment change, this should not cause an incapability to function. When resources become scarce in one area, the group system should adapt.
- Hardware, Communication and Cognitive economy: This refers to the trade-offs between increased complexity and the increased energy needed to keep a system running. Less complexity is better. We see this in human communication too: text strings can usually be compressed by a lot, they are not complex. Using short and simple words for often used expressions saves us energy.
- Predictability: This refers to the amount of regularity that has to be present in the environment for the total system to keep functioning. An efficient group is able to cope with unpredictable situations.
- Prior knowledge: The amount of information that has to be known for the system to operate. For example, does a map of the terrain need to generated first? Requiring less prior knowledge is better.
Luc Steels envisioned agents operating in a group, much like ants do when gathering resources: Robots could leave little pellets of radioactive material behind, to create slowly decaying trails for other robots to follow.
We find that whole communities suddenly fix their minds upon one object, and go mad in its pursuit; that millions of people become simultaneously impressed with one delusion, and run after it, till their attention is caught by some new folly more captivating than the first.Charles Mackay (1841), Extraordinary Popular Delusions and the Madness of Crowds
The problem of gathering resources in the most efficient way is an old (and difficult) one. From 5 slot machines with different payouts, how do you find out the best slot machine to play?
One technique is an epsilon-greedy Multi-Armed Bandit (MAB). And epsilon-greedy MAB starts out by exploring: trying many different slot machines. The exploiting rate is still low: it will not try the same slot machine over and over. After it has played some more the exploiting rate goes up. In the end it will play the slot machine with the highest expected payout.
You can set the MAB to be flexible to environment changes. Simply add a decay factor to the memory of the payouts. And don’t be too greedy: always explore a little to find good alternatives.
[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.Peter Whittle
Swarm intelligence is the collective behavior of decentralized, self-organized systems, natural or artificial. The concept was introduced by Beni & Wang in a 1989 paper on cellular robotic systems.
Ant colony optimization algorithms can solve the travelling salesman problem (finding the shortest paths between a number of resources). Particle Swarm Optimization uses insights from social behavior to create very efficient algorithms, more recently improved with local search.
“Ants are the history of social organization and the future of computers.” Kevin Kelly
Kaggle connects data science problems with data scientists. When you ensemble the models made by individual team members this team model is often better than any individual member. After a contest is over, incorporating the shared approaches from other teams often increases this score even more. Human decisions on the team model, such as which submission to select for final ranking, or what the weight of individual models should be, is often put to a vote.
Amazon Mechanical Turk
With Amazon’s Mechanical Turk (artificial artificial intelligence) one can automate crowdsourcing all sorts of tasks. Even tasks that may seem impossible for individuals to get right. Here is an experiment using Mechanical Turk to transcribe blurry text.
Can you read this text? The ensemble of Mechanical Turks made just 1 mistake.
“Winners are generally not domain experts but machine learning experts, we often have people working on 2-3 competitions at the same time” Howard on Kaggle
The “China Brain” is a thought experiment (philosophy of mind) to try to ridicule the thought that non-biological organisms can have feelings and cognition. If you give the entire nation of China two-way radio’s and make them relay signals, just like neurons would do, would this form a consciousness or mental state?
In Consciousness Explained Dennett argues that the China Brain would indeed have mental states. He and other functionalists do not agree that biological neurons are the only way to create an information processing intelligence.
In the multiple drafts model our consciousness is distributed and many parts contribute together in parallel. The individual consciousness — a centralized theater where consciousness resides is an illusion. Our consciousness could be a group effort already, where some parts (“observing light and dark”) are faster (less complex, less energy consuming) to execute and parse, than others (“abstract thought and self-realization”).
A China Brain could be viewed as a supermeme, a discarnate information processing being (A group mind, memeplex, or egregore). Where the judgment to go to war as a country is not made by a few individuals at the top, and collective group judgments outperform the judgments of individual members.
“There is a species of primate in South America more gregarious than most other mammals, with a curious behavior.The members of this species often gather in groups, large and small, and in the course of their mutual chattering , under a wide variety of circumstances, they are induced to engage in bouts of involuntary, convulsive respiration, a sort of loud, helpless, mutually reinforcing group panting that sometimes is so severe as to incapacitate them.
Far from being aversive, however, these attacks seem to be sought out by most members of the species, some of whom even appear to be addicted to them.
…the species is Homo sapiens, and the behavior is laughter.” Daniel Dennett – Consciousness Explained
The intro image came from the Microsoft Research paper: Low-Cost Audiance Polling using Computer Vision.