Phase 2 — Classical ML

May 20, 2026

Continue Machine Learning Specialization(Advanced Learning Algorithms)

What I Did

Implemented 2 layer neural network from scratch in Python using Numpy (link)

Implemented multilayer layer neural network from scratch in Python using Numpy (link)

Derived back propagation for the general case(multi layer neural network) where the activation function is sigmoid

Watched most of week 2 of course 2 of the Machine Learning specialization

Implemented a 2-layer NN from scratch in Numpy(no Keras)

Re-derived backprop for the 2 layer case and the general case also(beautiful result)

Implemented a multilayer NN from scratch in Numpy

What I Learned

Learned about the steps involved in training a neural network which are: specify the model, specify the loss and cost function and then train the model on the training dataset to minimize the cost function.

Loss function is a functions which measures how wrong the model is on a single training example

Cost function is a the average of the loss across the entire training dataset

Activation functions are non-linear functions applied to the output of each neuron. Without activation function, stacking layers would collapse into a single linear transformation

Different type of activation function such as sigmoid, linear, ReLU and softmax

We need non-linear activation function because without it, the transformation from one layer to the next will just be a linear transformation and the network is not gaining anything from the power of depth in the network. The non-linear function let the model learn arbitrary complex function by composing very simple non-linear piece

Not using a non-linear function is basically linear regression

It's a bit confusing that we have linear activation function since activation function are supposed to be non-linear, we can just refer to it as no activation

The choice of activation function from the output depends on the type of problem we are trying to solve, for binary classification → sigmoid, multi class classification → softmax, linear regression → ReLU(non negative) and linear(any real number)

The choice of activation function for hidden layer is by default ReLU

So why ReLU? The first is that it is faster to compute compared to sigmoid, doesn't saturate on the positive side and thus no vanishing gradient like sigmoid and in practice, neural network with ReLU hidden layer train faster and have better accuracy.

We can use multi-class classification when we have more than two classes, the activation function is softmax and the loss function is categorical cross-entropy.

Softmax convert logits into probability distribution over K classes

P(y = k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K}e^{z_{j}}}

Categorical cross-entropy measures how far the predicted distribution is from the true(one-hot) distribution

L = -\sum_{k=1}^{K} y_{k} \log(\hat{y}_{k})

Creating the model in Tensorflow:

model = Sequential([
  Dense(25, activation='relu'),
  Dense(15, activation='relu'),
  Dense(10, activation='linear'), # 10 classes
])

model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))

A less stable version would be doing this in the final layer Dense(10, activation='softmax') and then doing loss=SparseCategoricalCrossentropy()

Adam optimizer adjusts the learning rates per parameters based on past gradients

We use Adam optimizer for the following reasons:

Adapts the learning rate for each weight individually

Larger learning rates for weights with consistent gradients

Smaller learning rates for weights with noisy gradients

It also converges faster and more reliable than SGD.

Epoch is one complete pass through the entire training dataset

We can also split our training dataset in batches, common batch sizes are 32, 64 and 128

Iteration is one gradient update on one batch

Forward propagation:

\begin{aligned} Z^{[l]} &= W^{[l]}A^{[l-1]} + b^{[l]} \\ A^{[l]} &= g(Z^{[l]}) \end{aligned}

Backward propagation:

\begin{aligned} dZ^{[l]} &= ({W^{[l+1]}}^T \cdot dZ^{[l+1]}) \odot g'^{[l]}(Z^{[l]}) \\ dW^{[l]} &= \frac{1}{m} dZ^{[l]} \cdot (A^{[l-1]})^T \\ db^{[l]} &= \frac{1}{m} \sum_{i=1}^{m} dZ^{[l](i)} \end{aligned}

The way we initialize the weights is very important and can determine whether we will have vanishing or exploding gradients.

We can use Xavier initialization(best for sigmoid/tanh) and He initialization(best for ReLU)

Xavier (Glorot) initialization — for sigmoid and tanh:

W^{[l]} \sim \mathcal{N}\left(0, \frac{1}{n^{[l-1]}}\right)

Or equivalently in code form:

W^{[l]} = \mathcal{N}(0, 1) \times \sqrt{\frac{1}{n^{[l-1]}}}

He (Kaiming) initialization — for ReLU and variants:

W^{[l]} \sim \mathcal{N}\left(0, \frac{2}{n^{[l-1]}}\right)

Or equivalently:

W^{[l]} = \mathcal{N}(0, 1) \times \sqrt{\frac{2}{n^{[l-1]}}}

We also have multi-label classification where each examples can belong to multiple classes at the same time. We can compare this with multi-class classification where each example belongs to exactly one class and not more.

Multi-label and multi-hot coded ([1, 0, 1, 1, 0]) while multi-class are one-hot coded([0, 0, 1, 0, 0])

Bugs & Blockers

N/A

Concepts That Need More Time

The choosing the output of the multi-class classification due to errors in calculation(using the expression for the output a instead of using a directly) actually feels a bit confusing to me for now

Tomorrow

Watch week 2 of course 2 of the Machine Learning specialization

Implement a 2-layer NN from scratch in Numpy(no Keras)

Re-derive backprop for the 2 layer case

Implement a simple multiple class(like 3 or 4) NN using the softmax function as the activation function

Implemented a simple multiple class(like 3 or 4) NN using the softmax function as the activation function

Wins

Implemented a multilayer NN from scratch in Numpy

#ml