Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogCoursesPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Courses
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 12 of the AI/ML Foundations: Core Concepts & First Models course
AI/MLJune 25, 20263 min read

Training and Testing Data Splits: A Practical Guide

Learn how to use a train-test split to prevent models from memorizing the past. Discover why generalization is the key to building robust ML models.

machine learningpythonscikit-learndata sciencetutorialaimachine-learning

Previously in this course, we discussed Loss Functions and Model Objectives to understand how models learn to minimize error. Now that you know how a model "learns" from data, we must address the most common failure mode in machine learning: memorization.

If you train a model on your entire dataset, it will likely achieve perfect performance by simply "remembering" every row, rather than learning the underlying patterns. To build a system that works on new, unseen data, we must use a train-test split.

Why We Need a Train-Test Split

In software engineering, we write unit tests to ensure our code handles edge cases correctly. In machine learning, the train-test split is our primary mechanism for "testing" our model's ability to handle data it has never seen before.

When we talk about generalization, we are referring to a model's performance on new data. If a model performs perfectly on the training set but fails on the test set, it has likely overfitted—it learned the noise in your training data instead of the signal. By withholding a portion of your data (the test set) during training, you create a "blind test" that forces the model to prove it has actually learned a predictive relationship.

Implementing the Split with Scikit-Learn

Manually slicing arrays with NumPy is prone to error, especially when you need to ensure the data is shuffled to avoid bias (e.g., if your data is sorted by date).

Scikit-Learn provides a robust utility called train_test_split that handles shuffling and partitioning in one step. Here is how you apply it to your project dataset.

PYTHON
from sklearn.model_selection import train_test_split
import pandas as pd

# Load your cleaned project dataset(from previous lessons)
df = pd.read_csv("cleaned_project_data.csv")

# Separate features(X) and target(y)
X = df.drop(columns=[CE9178">'target_variable'])
y = df[CE9178">'target_variable']

# Perform the split
# test_size=0.2 means 20% of data is held back for testing
# random_state ensures your split is reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]} rows")
print(f"Testing set size: {X_test.shape[0]} rows")

Choosing the Right Ratio

The "standard" split is often 80/20 or 75/25. However, the exact ratio depends on your dataset size:

  • Small datasets (< 10,000 rows): You might need a larger training set (e.g., 90/10) so the model has enough examples to learn.
  • Large datasets (> 100,000 rows): A smaller test set (e.g., 99/1) is often sufficient to get a statistically significant evaluation.

Hands-on Exercise

Using your project dataset from Project Dataset Initialization, implement the train_test_split as shown above.

  1. Set your random_state to 42. This is a common convention in the industry to ensure that your results are reproducible if you run the notebook again.
  2. Check the shape of your resulting variables to confirm the 80/20 split was applied correctly.
  3. Challenge: What happens to your model's evaluation if you remove the random_state? (Hint: Try running the split multiple times and look at the first few rows of X_test).

Common Pitfalls

  • Data Leakage: This is the most dangerous trap. If you perform data preprocessing (like filling missing values or scaling) on the entire dataset before splitting, you "leak" information from the test set into the training process. Always split first, then scale/impute.
  • Ignoring Shuffling: If your data is ordered by time, a simple split will train on the past and test on the future. While this is sometimes intentional, failing to shuffle by default can lead to weird biases if the dataset has any inherent ordering.
  • Small Test Sets: If your test set is too small (e.g., only 5 rows), your evaluation metrics will be highly volatile. A single outlier in the test set will drastically skew your reported accuracy.

Recap

The train-test split is the foundation of reliable model evaluation. By partitioning your data, you guard against memorization and ensure that your model is capable of real-world generalization. Always remember: split your data before you perform any feature engineering or scaling.

Up next: We will discuss Data Scaling Techniques and why algorithms like Linear Regression require your features to be on a similar numerical scale.

Previous lessonLoss Functions and Model ObjectivesNext lesson Data Scaling Techniques
Back to Blog

Similar Posts

AI/MLJune 25, 20264 min read

Model Interpretability Basics: Coefficients and SHAP Explained

Learn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.

Read more
AI/MLJune 25, 20263 min read

Benchmarking Algorithms: Choosing the Right Model for Your Project

Stop guessing which model works best. Learn the principles of benchmarking algorithms to compare linear and tree-based models for your machine learning project.

Part of the course

AI/ML Foundations: Core Concepts & First Models

beginner · Lesson 12 of 50

  1. 1

    The Machine Learning Workflow

    4 min
  2. 2

    Setting Up the Python ML Environment

    4 min
  3. 3

    Introduction to NumPy for Data Handling

    4 min
Read more
AI/MLJune 25, 20263 min read

Advanced Feature Transformation: Handling Skewed Data Distributions

Master advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.

Read more
4

Loading and Inspecting Datasets with Pandas

3 min
  • 5

    Exploratory Data Analysis Fundamentals

    3 min
  • 6

    Handling Missing and Inconsistent Data

    3 min
  • 7

    Feature Selection and Basic Filtering

    3 min
  • 8

    Project Dataset Initialization

    3 min
  • 9

    Mechanics of Linear Regression

    4 min
  • 10

    Mechanics of Classification

    4 min
  • 11

    Loss Functions and Model Objectives

    4 min
  • 12

    Training and Testing Data Splits

    3 min
  • 13

    Data Scaling Techniques

    4 min
  • 14

    Encoding Categorical Variables

    3 min
  • 15

    Building Scikit-Learn Pipelines

    4 min
  • 16

    Training the Baseline Linear Model

    3 min
  • 17

    Training Error vs Generalization Error

    4 min
  • 18

    Overfitting and Underfitting

    4 min
  • 19

    Regression Evaluation Metrics

    4 min
  • 20

    The Confusion Matrix

    3 min
  • 21

    Error Analysis Plots

    4 min
  • 22

    Introduction to Cross-Validation

    4 min
  • 23

    Diagnosing Model Weaknesses

    3 min
  • 24

    Feature Engineering Strategies

    4 min
  • 25

    Handling Outliers

    3 min
  • 26

    The Bias-Variance Tradeoff

    3 min
  • 27

    Hyperparameter Tuning Basics

    4 min
  • 28

    Implementing Grid Search

    3 min
  • 29

    Refining the Project Model

    3 min
  • 30

    Evaluating Feature Importance

    3 min
  • 31

    Advanced Feature Transformation

    3 min
  • 32

    Regularization Techniques

    3 min
  • 33

    Comparing Different Algorithms

    3 min
  • 34

    Managing Model Complexity

    4 min
  • 35

    Understanding Data Drift

    4 min
  • 36

    Version Control for ML Experiments

    3 min
  • 37

    Exporting Trained Models

    3 min
  • 38

    Creating an Inference Script

    3 min
  • 39

    Building a Simple Web Interface

    3 min
  • 40

    Documenting ML Projects

    4 min
  • 41

    Final Project Review

    4 min
  • 42

    Ensemble Methods Overview

    4 min
  • 43

    Feature Selection via Recursive Elimination

    3 min
  • 44

    Model Interpretability Basics

    4 min
  • 45

    Dealing with High Cardinality

    3 min
  • 46

    Handling Multi-Collinearity

    4 min
  • 47

    Introduction to Pipelines with Custom Transformers

    3 min
  • 48

    Evaluating Model Calibration

    4 min
  • 49

    Advanced Hyperparameter Search

    3 min
  • 50

    Model Monitoring in Practice

    4 min
  • View full course