Introduction to NumPy for Data Handling: Arrays and Vectorization

Master NumPy arrays to handle numerical data efficiently. Learn how to perform fast element-wise operations and indexing for your ML projects.

NumPyData SciencePythonMachine LearningArraysVectorizationaimachine-learning

Previously in this course, we covered the Machine Learning Workflow and ensured your Python ML Environment was ready for heavy lifting. Now, we move from theory to the engine room: NumPy.

In machine learning, you don't process data one record at a time using Python lists. That is slow, memory-intensive, and impractical. Instead, we use NumPy, the library that provides high-performance, multidimensional arrays. When you understand NumPy, you understand how data actually flows through your models.

Why NumPy? First Principles of Vectorization

Python lists are flexible—they can hold integers, strings, and objects in the same container. This flexibility comes at a cost: every time you access an element, Python has to check its type.

NumPy arrays are different. They are homogeneous, meaning every element must be the same type (usually a 64-bit float or integer). Because the type is fixed, NumPy stores these elements in contiguous memory blocks. This allows for vectorization: the ability to perform operations on entire arrays at once without writing explicit for loops.

When you multiply an array by 2, NumPy pushes that operation down to highly optimized C code. This is the difference between a model that finishes training in seconds versus one that hangs for hours.

Creating and Manipulating Arrays

Close-up of a hand arranging purple tokens in a pattern on a vibrant yellow background.

To get started, you'll need to import the library. The convention is import numpy as np.

1. Creating Arrays

You can create arrays from standard Python lists or use built-in functions for common patterns.


PYTHON
import numpy as np

# From a list
data = np.array([1, 2, 3, 4])

# A 2x3 matrix of zeros(common for initializing weights)
zeros = np.zeros((2, 3))

# An array of evenly spaced numbers
range_arr = np.arange(0, 10, 2)  # Output: [0, 2, 4, 6, 8]

2. Element-wise Arithmetic

Vectorization means you treat the array as a single entity. If you have a dataset of feature values, you can normalize them or scale them in one line.


PYTHON
prices = np.array([100, 200, 300])

# Add 50 to every element simultaneously
taxed_prices = prices + 50 

# Multiply by a scalar
discounted = prices * 0.9

# Element-wise multiplication of two arrays of the same shape
base = np.array([10, 20, 30])
multiplier = np.array([1, 2, 3])
result = base * multiplier  # Output: [10, 40, 90]

3. Indexing and Slicing

Indexing in NumPy follows the [row, column] syntax. For a 2D array, arr[0, :] selects the entire first row, while arr[:, 1] selects the second column.


PYTHON
matrix = np.array([[1, 2, 3], 
                   [4, 5, 6]])

print(matrix[0, 1])    # Output: 2 (row 0, col 1)
print(matrix[1, :])    # Output: [4, 5, 6] (all columns in row 1)
print(matrix[:, 0:2])  # Output: [[1, 2], [4, 5]] (first two columns)

Hands-on Exercise: Preparing Feature Data

Imagine you have a small dataset representing the square footage and number of rooms for three houses.

Create a 3x2 NumPy array called house_data where the first column is square footage [1000, 1500, 2000] and the second column is the number of rooms [2, 3, 4].
Multiply the square footage column by 0.0929 to convert it to square meters (do this using slicing).
Print the resulting array.

Solution:


PYTHON
house_data = np.array([[1000, 2], [1500, 3], [2000, 4]])
house_data[:, 0] = house_data[:, 0] * 0.0929
print(house_data)

Common Pitfalls

Shape Mismatches: If you try to add a (3,) array to a (2,) array, NumPy will throw a ValueError. Always check array.shape if you're unsure.
Broadcasting Confusion: NumPy can "broadcast" a smaller array across a larger one (e.g., adding a scalar to a matrix), but this can lead to logic errors if you don't understand the dimensions. When in doubt, check your dimensions.
Modifying Views: Slicing an array does not create a copy; it creates a view. If you modify a slice, you modify the original array. If you need a separate copy, use slice.copy().

Recap

NumPy is the backbone of efficient numerical computation in Python. By using arrays instead of lists, you gain access to vectorization, which makes your code faster and more concise. We've mastered creating arrays, performing element-wise arithmetic, and using slicing to isolate specific data points.

Up next: Loading and Inspecting Datasets with Pandas.

Back to Blog