Learn how to use a train-test split to prevent models from memorizing the past. Discover why generalization is the key to building robust ML models.
Previously in this course, we discussed Loss Functions and Model Objectives to understand how models learn to minimize error. Now that you know how a model "learns" from data, we must address the most common failure mode in machine learning: memorization.
If you train a model on your entire dataset, it will likely achieve perfect performance by simply "remembering" every row, rather than learning the underlying patterns. To build a system that works on new, unseen data, we must use a train-test split.
In software engineering, we write unit tests to ensure our code handles edge cases correctly. In machine learning, the train-test split is our primary mechanism for "testing" our model's ability to handle data it has never seen before.
When we talk about generalization, we are referring to a model's performance on new data. If a model performs perfectly on the training set but fails on the test set, it has likely overfitted—it learned the noise in your training data instead of the signal. By withholding a portion of your data (the test set) during training, you create a "blind test" that forces the model to prove it has actually learned a predictive relationship.
Manually slicing arrays with NumPy is prone to error, especially when you need to ensure the data is shuffled to avoid bias (e.g., if your data is sorted by date).
Scikit-Learn provides a robust utility called train_test_split that handles shuffling and partitioning in one step. Here is how you apply it to your project dataset.
PYTHONfrom sklearn.model_selection import train_test_split import pandas as pd # Load your cleaned project dataset(from previous lessons) df = pd.read_csv("cleaned_project_data.csv") # Separate features(X) and target(y) X = df.drop(columns=[CE9178">'target_variable']) y = df[CE9178">'target_variable'] # Perform the split # test_size=0.2 means 20% of data is held back for testing # random_state ensures your split is reproducible X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Training set size: {X_train.shape[0]} rows") print(f"Testing set size: {X_test.shape[0]} rows")
The "standard" split is often 80/20 or 75/25. However, the exact ratio depends on your dataset size:
Using your project dataset from Project Dataset Initialization, implement the train_test_split as shown above.
random_state to 42. This is a common convention in the industry to ensure that your results are reproducible if you run the notebook again.random_state? (Hint: Try running the split multiple times and look at the first few rows of X_test).The train-test split is the foundation of reliable model evaluation. By partitioning your data, you guard against memorization and ensure that your model is capable of real-world generalization. Always remember: split your data before you perform any feature engineering or scaling.
Up next: We will discuss Data Scaling Techniques and why algorithms like Linear Regression require your features to be on a similar numerical scale.
Learn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.
Read moreStop guessing which model works best. Learn the principles of benchmarking algorithms to compare linear and tree-based models for your machine learning project.
Training and Testing Data Splits