Learn how to boost your model's performance by combining multiple learners. We cover voting, bagging, and how Random Forest delivers robust predictions.
Previously in this course, we explored managing model complexity to prevent overfitting in individual algorithms. In this lesson, we shift our focus from optimizing a single model to the power of the crowd: ensemble learning.
By combining multiple base models into a single predictive engine, we can often achieve higher accuracy and better stability than any individual model could provide on its own.
At its core, an ensemble is a group of models working together. Think of it like a committee: if you ask ten experts to solve a problem, the consensus is usually more reliable than the opinion of a single person who might have a specific bias.
In machine learning, we typically use two types of voting:
Soft voting is generally preferred because it accounts for how confident each model is in its decision, rather than just the final binary result.
If we just train ten identical models on the same data, they will all make the same mistakes. To make an ensemble effective, we need diversity.
Bagging (short for Bootstrap Aggregating) is the technique we use to force this diversity. Here is the process:
The Random Forest is the most famous implementation of bagging. It builds a collection of Decision Trees. To ensure the trees are truly different, Random Forest doesn't just sample the rows; it also randomly selects a subset of features at each split in the tree. This prevents one dominant feature from making every tree look identical.
Let’s see how to implement this in scikit-learn. We’ll use the RandomForestClassifier on our ongoing project dataset.
PYTHONfrom sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Assume X and y are already prepared from our previous lessons X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Instantiate the ensemble # n_estimators is the number of trees in the forest rf_model = RandomForestClassifier(n_estimators=100, random_state=42) # Train the model rf_model.fit(X_train, y_train) # Predict predictions = rf_model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, predictions)}")
By setting n_estimators=100, we are training 100 different decision trees. Because of the bagging and feature randomness, each tree learns a slightly different perspective of the data, leading to a much more robust final prediction.
RandomForestClassifier or RandomForestRegressor.n_estimators parameter to 10, 50, and 200. How does the training time change as you increase this number?RandomForest is powerful, its default settings aren't always optimal. Always perform hyperparameter tuning as discussed in implementing grid search to find the best max_depth and n_estimators.Pipeline is still best practice for reproducibility, especially if you decide to swap the ensemble for a distance-based model later.Ensemble methods allow us to combine the strengths of multiple models to reduce variance and improve generalization. By using bagging—specifically within the Random Forest algorithm—we create a collection of diverse, uncorrelated models that act as a safety net against the errors of any single tree. As you continue your final project review, consider how these ensemble techniques might help you squeeze out that last bit of performance.
Up next: We will explore how to automate feature pruning using Recursive Feature Elimination with Cross-Validation (RFECV).
Master advanced hyperparameter tuning with RandomizedSearchCV and Bayesian optimization. Learn to scale your experiments efficiently for better ML models.
Read moreLearn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.
Ensemble Methods Overview