Machine Learning - Stacking



Stacking, also known as stacked generalization, is an ensemble learning technique in machine learning where multiple models are combined in a hierarchical manner to improve prediction accuracy. The technique involves training a set of base models on the original training dataset, and then using the predictions of these base models as inputs to a meta-model, which is trained to make the final predictions.

The basic idea behind stacking is to leverage the strengths of multiple models by combining them in a way that compensates for their individual weaknesses. By using a diverse set of models that make different assumptions and capture different aspects of the data, we can improve the overall predictive power of the ensemble.

The stacking technique can be divided into two stages −

  • Base Model Training − In this stage, a set of base models are trained on the original training data. These models can be of any type, such as decision trees, random forests, support vector machines, neural networks, or any other algorithm. Each model is trained on a subset of the training data, and produces a set of predictions for the remaining data points.

  • Meta-model Training − In this stage, the predictions of the base models are used as inputs to a meta-model, which is trained on the original training data. The goal of the meta-model is to learn how to combine the predictions of the base models to produce more accurate predictions. The meta-model can be of any type, such as linear regression, logistic regression, or any other algorithm. The meta-model is trained using cross-validation to avoid overfitting.

Once the meta-model is trained, it can be used to make predictions on new data points by passing the predictions of the base models as inputs. The predictions of the base models can be combined in different ways, such as by taking the average, weighted average, or maximum.

Example

Here is an example implementation of stacking in Python using scikit-learn −

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define the base models
rf = RandomForestClassifier(n_estimators=10, random_state=42)
gb = GradientBoostingClassifier(random_state=42)

# Define the meta-model
lr = LogisticRegression()

# Define the stacking classifier
stack = StackingClassifier(classifiers=[rf, gb], meta_classifier=lr)

# Use cross-validation to generate predictions for the meta-model
y_pred = cross_val_predict(stack, X, y, cv=5)

# Evaluate the performance of the stacked model
acc = accuracy_score(y, y_pred)
print(f"Accuracy: {acc}")

In this code, we first load the iris dataset and define the base models, which are a random forest and a gradient boosting classifier. We then define the meta-model, which is a logistic regression model.

We create a StackingClassifier object with the base models and meta-model, and use cross-validation to generate predictions for the meta-model. Finally, we evaluate the performance of the stacked model using the accuracy score.

Output

When you execute this code, it will produce the following output −

Accuracy: 0.9666666666666667
Advertisements