Implementing a Multi-Layer Neural Network from Scratch
Implementing a Multi-Layer Neural Network from Scratch
In this tutorial, we'll build a neural network with an arbitrary number of hidden layers using only NumPy. By the end, you'll understand how forward and backward propagation actually work, with no PyTorch or TensorFlow, just matrix math.
Prerequisites: Basic Python and NumPy. Some linear algebra and calculus is helpful, but I'll provide all the formulas. If you want to see where backprop comes from, I derived it from scratch in my notes: Page 12-13.
What we'll cover:
- What neural networks are and why we use activation functions
- Defining multiple activation functions
- Setting up layer dimensions
- Initializing weights with proper scaling
- Forward propagation: computing predictions
- Backward propagation: computing gradients via the chain rule
- Updating parameters with gradient descent
- Putting it all together in a training loop
- Making predictions and visualizing decision boundaries
- Testing on multiple non-linear problems
What is a neural network?
A neural network is a function that maps inputs to outputs by stacking layers of weighted sums and non-linear activations.
Each layer takes the previous layer's output, multiplies it by a weight matrix, adds a bias, and applies a non-linear function. The output of one layer becomes the input to the next.
The key insight: with enough hidden layers and units, a neural network can approximate almost any function. This makes them much more expressive than logistic regression alone, which can only learn linear decision boundaries.
We'll build a network that supports binary classification using cross-entropy loss.
Why do we need activation functions?
A natural question that comes up when learning about neural network is why do we need activation functions?
Without an activation function, transformation from one layer to the becomes a linear transformation. For example, with two layers:
This is just another linear function, which can be represented by a single layer and this turns to linear/logistic regression. The depth gives us nothing.
Adding a non-linear activation breaks this collapse and lets the network learn complex, curved decision boundaries.
Setting up
We only need NumPy for the math and matplotlib for visualization.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
# Set seed for reproducibility
np.random.seed(42)Activation functions
We'll define a few common activations so we can pick which one to use:
Sigmoid squashes values into (0, 1). Common at the output layer for binary classification. Used to be common in hidden layers, but causes vanishing gradients in deep networks.
Tanh is like sigmoid but maps to (-1, 1) and is centered at zero. Often works better than sigmoid in hidden layers.
ReLU outputs . The default choice for hidden layers in modern networks. Fast to compute and doesn't suffer from vanishing gradients on the positive side.
For backpropagation, we also need the derivative of each activation. The derivatives have clean closed forms:
- if , else
def sigmoid(z):
"""
Sigmoid activation: maps any real number to (0, 1).
Formula: sigma(z) = 1 / (1 + exp(-z))
"""
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(A):
"""
Derivative of sigmoid, expressed in terms of the activation A.
Since sigma'(z) = sigma(z) * (1 - sigma(z)), and A = sigma(z),
we can compute the derivative directly from A without needing z.
"""
return A * (1 - A)
def tanh(z):
"""
Tanh activation: maps any real number to (-1, 1).
"""
return np.tanh(z)
def tanh_derivative(A):
"""
Derivative of tanh, expressed in terms of A.
tanh'(z) = 1 - tanh(z)^2, and A = tanh(z), so derivative = 1 - A^2.
"""
return 1 - A**2
def relu(z):
"""
ReLU activation: outputs max(0, z) element-wise.
"""
return np.maximum(0, z)
def relu_derivative(Z):
"""
Derivative of ReLU.
ReLU'(z) = 1 if z > 0, else 0. This depends on Z (pre-activation),
not A (post-activation), which is why we pass Z here rather than A.
"""
return (Z > 0).astype(float)Activation dispatcher
To make it easy to switch between activations, we register each one in a dictionary keyed by name. Each entry stores both the function and its derivative as a pair.
This way, our forward and backward propagation functions can accept a string like 'relu' and look up the right function at runtime.
# Map activation names to their function and derivative.
# This lets us specify activations by name throughout the rest of the code.
ACTIVATIONS = {
'sigmoid': (sigmoid, sigmoid_derivative),
'tanh': (tanh, tanh_derivative),
'relu': (relu, relu_derivative)
}Visualizing the activations
Let's see how each activation behaves across a range of inputs. The shape of each function determines how the network learns.
z = np.linspace(-5, 5, 200)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, sigmoid(z), label='sigmoid')
plt.plot(z, tanh(z), label='tanh')
plt.plot(z, relu(z), label='relu')
plt.title('Activation functions')
plt.xlabel('z')
plt.ylabel('a')
plt.legend()
plt.grid(alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(sigmoid(z)), label="sigmoid'")
plt.plot(z, tanh_derivative(tanh(z)), label="tanh'")
plt.plot(z, relu_derivative(z), label="relu'")
plt.title('Derivatives')
plt.xlabel('z')
plt.ylabel("a'")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Defining the network architecture
The architecture is described by a list of layer sizes. For example, [2, 4, 3, 1] means:
- Input layer: 2 features per example
- Hidden layer 1: 4 neurons
- Hidden layer 2: 3 neurons
- Output layer: 1 neuron (binary classification)
The first entry is determined by your data (number of features), and the last is determined by your task (number of outputs). The middle entries are choices you make. More hidden units gives the network more capacity to learn complex patterns.
This function builds that list from the input data and a list of hidden layer sizes.
def layer_sizes(X, Y, n_hs):
"""
Build the full list of layer sizes for the network.
Args:
X (np.ndarray): Input data, shape (n_features, m)
Y (np.ndarray): Labels, shape (n_outputs, m)
n_hs (list): Sizes of the hidden layers, e.g. [4, 3]
Returns:
list: All layer sizes from input to output, e.g. [2, 4, 3, 1]
"""
n_x = X.shape[0] # Number of input features
n_y = Y.shape[0] # Number of output units
return [n_x] + list(n_hs) + [n_y]Initializing the weights
The weights are the parameters the network learns. Each layer has:
- A weight matrix of shape
- A bias vector of shape
How we initialize these matters a lot. If we set them all to zero, all neurons would compute the same thing and update identically. The network would never learn. If we set them to large random values, the activations and gradients can explode.
The right initialization scheme depends on the activation function:
Xavier initialization for sigmoid and tanh:
He initialization for ReLU:
The factor of 2 in He init compensates for the fact that ReLU zeros out half the activations on average, so weights need to be slightly larger to maintain variance across layers.
Biases are always initialized to zero.
def init_params(ls, activation='relu'):
"""
Initialize weights and biases for every layer.
Uses He initialization (factor 2) for ReLU activations and
Xavier initialization (factor 1) for sigmoid and tanh.
Args:
ls (list): All layer sizes, including input and output
activation (str): The hidden activation, controls the init scheme
Returns:
dict: Parameters with keys 'W1', 'b1', 'W2', 'b2', etc.
"""
parameters = {}
# He init for ReLU, Xavier for everything else
factor = 2.0 if activation == 'relu' else 1.0
# Loop over each layer (skipping the input, which has no weights)
for i in range(1, len(ls)):
# W has shape (current_layer_size, previous_layer_size)
W = np.random.randn(ls[i], ls[i-1]) * np.sqrt(factor / ls[i-1])
# b has shape (current_layer_size, 1), one bias per neuron
b = np.zeros((ls[i], 1))
parameters[f"W{i}"] = W
parameters[f"b{i}"] = b
return parametersForward propagation
Forward propagation computes the network's output for a given input. We pass the input through each layer in sequence, computing:
Where (the input is the activation of layer 0) and is the activation function for layer .
We save all the intermediate and values in a cache so backpropagation can use them later. This avoids recomputing them during the backward pass.
The final activation is the network's prediction.
We allow different activations for hidden layers and the output layer. For binary classification, sigmoid at the output is what we use regardless of what we use in hidden layers.
def forward_propagation(parameters, ls, X, hidden_activation='relu', output_activation='sigmoid'):
"""
Compute the network's predictions through forward propagation.
Starting from the input, each layer's output becomes the next layer's input.
All intermediate values are cached so backpropagation can reuse them.
Different activations can be used for hidden layers vs the output layer.
Args:
parameters (dict): Weights and biases from init_params
ls (list): All layer sizes
X (np.ndarray): Input data, shape (n_features, m)
hidden_activation (str): Activation for hidden layers
output_activation (str): Activation for the output layer
Returns:
tuple: (final_activations, cache)
final_activations: predictions of the network, shape (n_outputs, m)
cache: dict of intermediate Z and A values for each layer
"""
cache = {"A0": X} # The input is the "activation" of layer 0
L = len(ls) - 1 # Index of the output layer
# Look up the activation functions by name
hidden_act, _ = ACTIVATIONS[hidden_activation]
output_act, _ = ACTIVATIONS[output_activation]
# Walk forward through every layer
for i in range(1, len(ls)):
W = parameters[f"W{i}"]
b = parameters[f"b{i}"]
A_prev = cache[f"A{i-1}"]
# Linear step: weighted sum plus bias
Z = W @ A_prev + b
# Non-linear step: apply the appropriate activation
if i == L:
A = output_act(Z)
else:
A = hidden_act(Z)
# Save both for backprop
cache[f"Z{i}"] = Z
cache[f"A{i}"] = A
# The output is the activation of the last layer
return cache[f"A{L}"], cacheBackward propagation
Backward propagation computes how each weight should change to reduce the loss. It applies the chain rule from calculus, working backward from the output layer to the input.
At each layer, we compute three things:
Where is element-wise multiplication.
For the output layer, the formula for simplifies dramatically because of how sigmoid and binary cross-entropy interact:
Notice the recursive structure: for layer depends on for layer . This is why we must walk backward. We can't compute earlier gradients without the later ones.
def backward_propagation(parameters, cache, ls, X, Y, hidden_activation='relu'):
"""
Compute gradients for every parameter using the chain rule.
Walks backward from the output to the input, computing the gradient of
the loss with respect to each layer's pre-activation (dZ), weights (dW),
and biases (db).
The output layer assumes binary cross-entropy plus sigmoid pairing, which
gives dZ = A - Y directly.
Args:
parameters (dict): Weights and biases
cache (dict): Cached Z and A values from forward propagation
ls (list): All layer sizes
X (np.ndarray): Input data
Y (np.ndarray): True labels
hidden_activation (str): Activation used in hidden layers
Returns:
dict: Gradients for each layer's weights and biases
"""
grads = {}
m = X.shape[1] # Number of training examples
L = len(ls) - 1 # Output layer index
# Get the derivative function for hidden layers
_, hidden_act_derivative = ACTIVATIONS[hidden_activation]
# Walk backward through layers
for i in reversed(range(1, len(ls))):
A = cache[f"A{i}"]
Z = cache[f"Z{i}"]
A_prev = cache[f"A{i-1}"]
if i == L:
# Output layer: simplified formula from sigmoid plus cross-entropy
dZ = A - Y
else:
# Hidden layer: propagate error from the next layer back
dZ_next = grads[f"dZ{i+1}"]
W_next = parameters[f"W{i+1}"]
# ReLU's derivative needs Z; sigmoid and tanh need A
if hidden_activation == 'relu':
dZ = (W_next.T @ dZ_next) * hidden_act_derivative(Z)
else:
dZ = (W_next.T @ dZ_next) * hidden_act_derivative(A)
# Gradients for this layer's parameters
dW = (1/m) * dZ @ A_prev.T
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
grads[f"dW{i}"] = dW
grads[f"db{i}"] = db
grads[f"dZ{i}"] = dZ # Cached for the previous layer's computation
return gradsUpdating the parameters
Once we have the gradients, we apply gradient descent. For each parameter:
Where is the learning rate. The gradient tells us the direction that increases the loss, so we move in the opposite direction to decrease it.
The learning rate controls step size:
- Too small: training is slow
- Too large: training overshoots and may not converge
- Just right: training converges steadily
For sigmoid networks with Xavier initialization, learning rates around 0.1 to 1.0 typically work well. For ReLU networks with He initialization, slightly smaller rates around 0.01 to 0.1 are often better.
def update_params(parameters, grads, ls, learning_rate):
"""
Update all parameters using gradient descent.
For each layer, subtract learning_rate times the gradient from each
parameter. This moves the parameter in the direction that reduces loss.
Args:
parameters (dict): Current weights and biases
grads (dict): Gradients from backward_propagation
ls (list): All layer sizes
learning_rate (float): Step size for gradient descent
Returns:
dict: Updated parameters
"""
for i in range(1, len(ls)):
parameters[f"W{i}"] -= learning_rate * grads[f"dW{i}"]
parameters[f"b{i}"] -= learning_rate * grads[f"db{i}"]
return parametersPutting it all together
The training loop repeats the three steps (forward prop, backward prop, parameter update) many times until the network converges.
Each iteration:
- Forward prop: compute predictions on the training data
- Backward prop: compute gradients
- Update: adjust parameters using the gradients
We also track the cost (loss averaged over all examples) so we can see if the network is actually learning. The cost should decrease over time.
We're using full-batch gradient descent here, where every iteration uses all training examples. For larger datasets you'd switch to mini-batch, but for small problems this is simpler and works fine.
def multi_layer_neural_network(X, Y, n_hs, num_epochs=10000, learning_rate=0.01,
hidden_activation='relu', output_activation='sigmoid', print_cost=False):
"""
Train a neural network on data X with labels Y.
Args:
X (np.ndarray): Input data, shape (n_features, m)
Y (np.ndarray): Labels, shape (n_outputs, m)
n_hs (list): Sizes of hidden layers, e.g. [4, 3]
num_epochs (int): Number of training iterations
learning_rate (float): Step size for gradient descent
hidden_activation (str): Activation for hidden layers
output_activation (str): Activation for the output layer
print_cost (bool): If True, print cost every 200 epochs
Returns:
tuple: (parameters, layer_sizes) for the trained model
"""
# Set up the architecture and initial parameters
ls = layer_sizes(X, Y, n_hs)
parameters = init_params(ls, activation=hidden_activation)
# Training loop
for epoch in range(num_epochs):
# Forward: compute predictions
A_out, cache = forward_propagation(parameters, ls, X, hidden_activation, output_activation)
# Backward: compute gradients
grads = backward_propagation(parameters, cache, ls, X, Y, hidden_activation)
# Update: take a gradient step
parameters = update_params(parameters, grads, ls, learning_rate)
# Optionally track progress
if print_cost and epoch % 200 == 0:
# Binary cross-entropy loss
cost = -np.mean(Y * np.log(A_out + 1e-15) + (1 - Y) * np.log(1 - A_out + 1e-15))
print(f"Epoch {epoch}: cost = {cost:.4f}")
return parameters, lsMaking predictions
To predict, we run forward propagation on new data and threshold the output. For binary classification with sigmoid output:
- If the network outputs > 0.5, predict class 1
- Otherwise, predict class 0
Different thresholds shift the precision/recall trade-off, but 0.5 is the standard default.
def predict(parameters, ls, X, hidden_activation='relu', output_activation='sigmoid'):
"""
Predict classes for new examples.
Args:
parameters (dict): Trained weights and biases
ls (list): Layer sizes
X (np.ndarray): New examples, shape (n_features, m)
hidden_activation (str): Activation used during training
output_activation (str): Output activation used during training
Returns:
np.ndarray: Predicted classes (0 or 1)
"""
A_out, _ = forward_propagation(parameters, ls, X, hidden_activation, output_activation)
return A_out > 0.5Visualizing the decision boundary
For 2D data, we can visualize what the network actually learned. We:
- Create a fine grid covering the input space
- Predict the class for every point on the grid
- Color regions by prediction
- Overlay the actual training data
This shows the decision boundary as a colored region. Linear models produce straight boundaries; multi-layer networks can produce curves, multiple regions, and complex shapes.
def plot_decision_boundary(predict, parameters, ls, X, Y, hidden_activation='relu', output_activation='sigmoid'):
"""
Visualize the network's decision boundary on 2D data.
Args:
predict (callable): The predict function
parameters (dict): Trained parameters
ls (list): Layer sizes
X (np.ndarray): Input data, shape (2, m)
Y (np.ndarray): True labels for coloring scatter points
hidden_activation (str): Activation used during training
output_activation (str): Output activation used during training
"""
# Define the visualization region with padding
min1, max1 = X[0, :].min() - 1, X[0, :].max() + 1
min2, max2 = X[1, :].min() - 1, X[1, :].max() + 1
# Build a fine grid over this region
x1grid = np.arange(min1, max1, 0.1)
x2grid = np.arange(min2, max2, 0.1)
xx, yy = np.meshgrid(x1grid, x2grid)
# Flatten the grid to feed to the network
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((1, len(r1))), r2.reshape((1, len(r2)))
grid = np.vstack((r1, r2))
# Predict every grid point
predictions = predict(parameters, ls, grid, hidden_activation, output_activation)
zz = predictions.reshape(xx.shape)
# Color the decision regions, then plot the data on top
plt.contourf(xx, yy, zz, cmap=plt.cm.Spectral.reversed())
plt.scatter(X[0, :], X[1, :], c=Y, cmap=colors.ListedColormap(['blue', 'red']))
plt.show()Sanity check: a simple linear problem
Let's verify everything works on the simplest possible task: a 2D linearly separable dataset. The network should easily hit 100% accuracy.
# Generate linearly separable data
n_features = 2
m = 100
X = np.random.randn(n_features, m)
Y = (X[0] + X[1] > 0).reshape(1, m).astype(int)
# Train a small network with ReLU hidden layers
parameters, ls = multi_layer_neural_network(X, Y, n_hs=[4, 3], num_epochs=2000, learning_rate=0.1,
hidden_activation='relu', print_cost=True)
# Evaluate
preds = predict(parameters, ls, X)
accuracy = np.mean(preds == Y) * 100
print(f"\nAccuracy: {accuracy:.1f}%")
# Visualize the learned boundary
plot_decision_boundary(predict, parameters, ls, X, Y)Testing on a non-linear problem
The whole point of multi-layer networks is to learn things linear models can't. Let's try the classic XOR-style problem: classify by whether the two features have the same sign.
A single perceptron (or logistic regression) cannot solve this. There's no straight line that separates the classes. But a multi-layer network with hidden units should handle it easily.
# Generate XOR-style data: positive class when features have same sign
n_features = 2
m = 1000
X = np.random.randn(n_features, m) * 2
Y = (X[0] * X[1] > 0).reshape(1, m).astype(int)
# Train a slightly bigger network
parameters, ls = multi_layer_neural_network(X, Y, n_hs=[8, 4], num_epochs=10000, learning_rate=0.1,
hidden_activation='relu')
# Evaluate
preds = predict(parameters, ls, X)
accuracy = np.mean(preds == Y) * 100
print(f"\nAccuracy: {accuracy:.1f}%")
plot_decision_boundary(predict, parameters, ls, X, Y)Concentric circles
Another classic non-linear problem: classify points based on whether they fall inside or outside a circle. The decision boundary should be circular, which is impossible for any linear model and trivial for a neural network.
from sklearn.datasets import make_circles
# Generate two concentric rings of points
X_raw, Y_raw = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
X = X_raw.T # Reshape to (2, m)
Y = Y_raw.reshape(1, -1) # Reshape to (1, m)
# Train
parameters, ls = multi_layer_neural_network(X, Y, n_hs=[16, 8], num_epochs=15000, learning_rate=0.1,
hidden_activation='relu')
# Evaluate
preds = predict(parameters, ls, X)
accuracy = np.mean(preds == Y) * 100
print(f"\nAccuracy: {accuracy:.1f}%")
plot_decision_boundary(predict, parameters, ls, X, Y)Comparing activation functions
Now that we have multiple activations available, let's compare them on the same problem. We'll train three networks with identical architectures but different hidden activations and see how they perform.
# Use the circles dataset for this comparison
X_raw, Y_raw = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
X = X_raw.T
Y = Y_raw.reshape(1, -1)
# Train one network per activation
results = {}
for activation in ['sigmoid', 'tanh', 'relu']:
# Use a slightly lower learning rate for sigmoid since it can be unstable
lr = 1.0 if activation == 'sigmoid' else 0.1
parameters, ls = multi_layer_neural_network(X, Y, n_hs=[16, 8], num_epochs=10000,
learning_rate=lr, hidden_activation=activation)
preds = predict(parameters, ls, X, hidden_activation=activation)
acc = np.mean(preds == Y) * 100
results[activation] = acc
print(f"{activation}: {acc:.1f}% accuracy")Summary
In this tutorial, we implemented a multi-layer neural network entirely from scratch:
- Architecture described by a list of layer sizes
- Three activation functions (sigmoid, tanh, ReLU) we can swap between
- Initialization that scales based on the activation choice
- Forward propagation to compute predictions
- Backward propagation using the chain rule
- Gradient descent to update parameters
- Training loop combining all of the above
- Prediction and visualization to interpret the results
The same fundamental ideas power every neural network, from this small implementation to GPT-scale models. The math doesn't change. Only the scale (more layers, more units, more data, different activation functions, and many engineering optimizations).
Limitations of this implementation:
- No mini-batching, we use full-batch gradient descent
- No regularization (L2, dropout, etc.) to prevent overfitting
- No advanced optimizers like Adam
- Only supports binary classification
Next steps to explore:
- Add L2 regularization or dropout for better generalization
- Implement mini-batch gradient descent for larger datasets
- Add support for multi-class classification with softmax
- Implement Adam or another adaptive optimizer
- Try training on MNIST or another real dataset
Once you understand the math here, frameworks like PyTorch and TensorFlow are mostly about scale and convenience. The core mechanics are exactly what we just built.