Phase 2 — Classical ML
May 20, 2026
Continue Machine Learning Specialization(Advanced Learning Algorithms)
What I Did
Implemented 2 layer neural network from scratch in Python using Numpy (link)
Implemented multilayer layer neural network from scratch in Python using Numpy (link)
Derived back propagation for the general case(multi layer neural network) where the activation function is sigmoid
Watched most of week 2 of course 2 of the Machine Learning specialization
Implemented a 2-layer NN from scratch in Numpy(no Keras)
Re-derived backprop for the 2 layer case and the general case also(beautiful result)
Implemented a multilayer NN from scratch in Numpy
What I Learned
Learned about the steps involved in training a neural network which are: specify the model, specify the loss and cost function and then train the model on the training dataset to minimize the cost function.
Loss function is a functions which measures how wrong the model is on a single training example
Cost function is a the average of the loss across the entire training dataset
Activation functions are non-linear functions applied to the output of each neuron. Without activation function, stacking layers would collapse into a single linear transformation
Different type of activation function such as sigmoid, linear, ReLU and softmax
We need non-linear activation function because without it, the transformation from one layer to the next will just be a linear transformation and the network is not gaining anything from the power of depth in the network. The non-linear function let the model learn arbitrary complex function by composing very simple non-linear piece
Not using a non-linear function is basically linear regression
It's a bit confusing that we have linear activation function since activation function are supposed to be non-linear, we can just refer to it as no activation
The choice of activation function from the output depends on the type of problem we are trying to solve, for binary classification → sigmoid, multi class classification → softmax, linear regression → ReLU(non negative) and linear(any real number)
The choice of activation function for hidden layer is by default ReLU
So why ReLU? The first is that it is faster to compute compared to sigmoid, doesn't saturate on the positive side and thus no vanishing gradient like sigmoid and in practice, neural network with ReLU hidden layer train faster and have better accuracy.
We can use multi-class classification when we have more than two classes, the activation function is softmax and the loss function is categorical cross-entropy.
Softmax convert logits into probability distribution over K classes
Categorical cross-entropy measures how far the predicted distribution is from the true(one-hot) distribution
Creating the model in Tensorflow:
model = Sequential([
Dense(25, activation='relu'),
Dense(15, activation='relu'),
Dense(10, activation='linear'), # 10 classes
])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))A less stable version would be doing this in the final layer Dense(10, activation='softmax') and then doing loss=SparseCategoricalCrossentropy()
Adam optimizer adjusts the learning rates per parameters based on past gradients
We use Adam optimizer for the following reasons:
Adapts the learning rate for each weight individually
Larger learning rates for weights with consistent gradients
Smaller learning rates for weights with noisy gradients
It also converges faster and more reliable than SGD.
Epoch is one complete pass through the entire training dataset
We can also split our training dataset in batches, common batch sizes are 32, 64 and 128
Iteration is one gradient update on one batch
Forward propagation:
Backward propagation:
The way we initialize the weights is very important and can determine whether we will have vanishing or exploding gradients.
We can use Xavier initialization(best for sigmoid/tanh) and He initialization(best for ReLU)
Xavier (Glorot) initialization — for sigmoid and tanh:
Or equivalently in code form:
He (Kaiming) initialization — for ReLU and variants:
Or equivalently:
We also have multi-label classification where each examples can belong to multiple classes at the same time. We can compare this with multi-class classification where each example belongs to exactly one class and not more.
Multi-label and multi-hot coded ([1, 0, 1, 1, 0]) while multi-class are one-hot coded([0, 0, 1, 0, 0])
Bugs & Blockers
N/A
Concepts That Need More Time
The choosing the output of the multi-class classification due to errors in calculation(using the expression for the output a instead of using a directly) actually feels a bit confusing to me for now
Tomorrow
Watch week 2 of course 2 of the Machine Learning specialization
Implement a 2-layer NN from scratch in Numpy(no Keras)
Re-derive backprop for the 2 layer case
Implement a simple multiple class(like 3 or 4) NN using the softmax function as the activation function
Implemented a simple multiple class(like 3 or 4) NN using the softmax function as the activation function
Wins
Implemented a multilayer NN from scratch in Numpy