Learn to initialize your ML project dataset with a rigorous data audit and cleaning workflow, ensuring your data is ready for predictive modeling.
Previously in this course, we covered Loading and Inspecting Datasets with Pandas: A Practical Guide. Now that you know how to peek into a CSV, this lesson moves from inspection to action: we are initializing our project dataset by performing a formal data audit and saving a cleaned version to serve as the "ground truth" for all subsequent modeling.
In a professional environment, you never work directly on raw data. You establish an initialization script that converts the "dirty" source into a "clean" foundation. This practice ensures reproducibility and prevents data leakage across your experiments.
A data audit is not just looking at the head() of a DataFrame. It is a systematic process of identifying the "health" of your features. Before we build a single model, we must answer three questions:
We will use a hypothetical raw_housing_data.csv for our project. We will load the data, perform the audit, and save a clean_housing_data.csv.
PYTHONimport pandas as pd import numpy as np # 1. Load the raw project dataset df = pd.read_csv(CE9178">'raw_housing_data.csv') # 2. Perform the initial data audit def audit_dataset(df): print("--- Dataset Audit ---") print(f"Shape: {df.shape}") print("\nMissing Values:\n", df.isnull().sum()) print("\nData Types:\n", df.dtypes) # Check for duplicates print(f"\nDuplicate rows: {df.duplicated().sum()}") audit_dataset(df) # 3. Basic Cleaning Actions # Drop duplicates and ensure correct types df_clean = df.drop_duplicates() df_clean[CE9178">'date'] = pd.to_datetime(df_clean[CE9178">'date']) # 4. Save the cleaned version for future lessons df_clean.to_csv(CE9178">'clean_housing_data.csv', index=False) print("\nSuccess: Cleaned data saved to CE9178">'clean_housing_data.csv'")
This workflow ensures that every time you start a new notebook for The Machine Learning Workflow: From Data to Deployment, you are importing a consistent, verified snapshot of your data.
initialize_data.py.clean_data.csv appears in your directory.C:/Users/Name/Desktop/..., which will break as soon as you share the code or move to a server.raw file. Your cleaning script should be an idempotent transformation that reads the raw file and writes to a new destination—never overwrite your source of truth.object type will prevent you from performing math on it later. Always force conversion or clean the symbols first.In this lesson, we established the "Initialization" phase of our project. By performing a rigorous data audit and saving a clean version of our data, we ensure that subsequent lessons—like those using Introduction to NumPy for Data Handling: Arrays and Vectorization—have a reliable input. A clean data workflow is the difference between a project that runs smoothly and one that is plagued by "dirty" bugs in production.
Up next: We will begin our predictive work by exploring the Mechanics of Linear Regression.
Master the art of the final project review. Learn to synthesize your ML pipeline, critique your model's results, and document lessons for future growth.
Read moreOutliers can derail your model’s performance. Learn to identify them using the IQR method and decide when to cap or remove them for better model accuracy.
Project Dataset Initialization