Ask a Data Scientist: Ensemble Methods

Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideAI News feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insideAINews.com. This week’s question is from a reader who asks about ensemble methods and how you use them.

Q: What are ensemble methods? How do you use them?

A: Ensemble methods have become an important addition to the data scientist’s toolbox. The technique is so popular that it is often the strategy of choice for participants of machine learning competitions such as Kaggle where most of the winning solutions are based on ensemble methods. Basically, ensemble methods are statistical learning algorithms that achieve enhanced predictive performance by constructing a set of classifiers, usually small decision trees, and then classifying new data points by taking a weighted vote of their predictions.

To conceptualize the ensemble process, think of the recognition that groups of people can often make better decisions than individuals, especially when group members each come in with their own biases. The same is true in machine learning. A nice characteristic of ensembles is that you can often get away with using much simpler learners and still achieve great performance. In essence, an ensemble is a supervised learning technique for combining many weak learners in an attempt to produce a strong learner.

Evaluating the prediction produced by an ensemble typically is more computation intensive than evaluating the prediction of a single model, so in a sense ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Well-performing machine learning algorithms such as classification trees are commonly used with ensembles. A good example of how ensemble methods commonly are used to solve tough data science problems is the random forest algorithm.

There are many types of ensemble methods. The original ensemble method is Bayesian averaging. Unfortunately, due to reasons of computational complexity, only the simplest problems can be practically implemented with this method.

One popular ensemble method that’s frequently used in practice is called bagging (more formerly bootstrap aggregating), which involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. Bagging also reduces variance and helps to avoid overfitting. As an example, the random forest algorithm combines collections of random decision trees (i.e. forests) with bagging to achieve very high classification accuracy. If you have enough trees, the random ones will wash out as noise, and only the good trees will have an effect on the final classification.

The power of bagging is similar to posing a question to the audience for “Who Wants to Be a Millionaire.” Each individual audience member has some information about what the correct answer is, and has a better-than-random chance of guessing the correct answer. Even if each individual audience member has a high probability of guessing incorrectly, it is still unlikely that the plurality of the audience will be wrong. The analogy also points out some limitations of bagging. First, the individual classifiers have to be informative. Asking the audience doesn’t work if everyone in the audience has no idea and people guess completely at random. Second, bagging doesn’t give you much, if anything, if some of the composite classifiers are already strong predictors on their own. If you knew that one person in the audience had a high probability of getting the answer correct, you’d rather just ask that one person than polling the whole audience. Bagging is most appropriate in situations where there are many “ok” classifiers, but it’s difficult to find a single good classifier.

Another ensemble framework is called boosting which involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified. Boosting starts by fitting a simple classifier on the original data, giving each observation an equal weight. After fitting the first model, prediction errors are computed and the data reweighted to give previously misclassified example a higher weight than correctly classified examples. Observations are incrementally reweighted observations at each step to give greater weight to observations that the previous model misclassified. While bagging works by reducing variance, boosting works by incrementally refining the decision boundary of the aggregate classifier (bagging does also improve the decision boundary, boosting just does it in a more directed way).

In some cases, boosting has been shown to yield better predictive accuracy than bagging, but it also tends to be more likely to over-fit the training data. By far, the most common implementation of boosting is the AdaBoost algorithm (adaptive boosting algorithm). AdaBoost is famous because it was one of the first practical boosting algorithms as it runs in polynomial time and doesn’t require you to specify a large number of parameters. Its name is derived from this advantage – it automatically adapts to the data it’s given.

Blending is the simplest and most intuitive way to combine different models. Blending involves taking the predictions of several composite models and including them in a larger model, such as a second stage linear or logistic regression. Blending can be used on any type of composite models, but is more appropriate when the composite models are fewer and more complex than in boosting or bagging. What properties blending has depends on exactly how models are combined. If a simple linear regression is used, this is equivalent to taking a weighted average of all predictions and works simply by reducing variance. If models are combined using logistic regression, neural network or even linear regressions with interactions, then composite models are able to have multiplicative effects on one another.

Whether it’s appropriate to use an ensemble and what kind depends on the specific problem that you’re trying to solve. Throwing lots of models at a problem and combining all of their predictions is rarely ever a good substitute for domain knowledge, but it can be quite useful when domain knowledge or availability of highly salient features is poor. Ensembling also tends to make models less easy to interpret, but still easier to interpret than black box models like neural networks.

A good way to get up to speed with the power of ensemble methods is by experimenting with the open source R statistical environment. R has a randomForest algorithm for classification and regression that’s a port of the Fortran original by Leo Breiman and Adele Cutler.

michael_alton

Data Scientist: Dr. Michael Alton is a Data Scientist with Intel’s Graph Analytics Operation working on applying analytics to network security. Alton received his PhD from The California Institute of Technology and has worked in analytic consulting to solve a variety of business problems such as financial risk management, fraud detection, and churn analysis.

Comments

  1. Lyndon Adams says

    Probably the best text I have read for this area. Thank you for posting.