Phase 2 — Classical ML
May 22, 2026
Continue Machine Learning Specialization(Advanced Learning Algorithms)
What I Did
Implemented decision tree from scratch and used it in a sample training dataset
Watched a couple week 4 lectures but didn't finish it due to time constraint
What I Learned
Learned a whole bunch about decision trees.
Decision tree is a model that makes prediction by answering a series of yes/no question about the input data
There are steps to which a decision tree learns:
- We start at the root level with all training examples
- We pick the best feature to use in splitting(the aim is to maximize the purity and we can pick the feature that gives the highest information gain)
- We partition the data based on that feature
- We repeat the process to each of the child nodes(recursively)
- We stop when a certain criterion is met
The stopping criterion may be one or all of the following:
- The node has 100% purity
- We reached maximum depth
- The information gain by further splits is below a threshold
- The number of examples in a node is below a minimum
The stopping criteria is actually very useful because it helps prevent the model from memorizing the training example(overfitting)
The way we measure purity is using the entropy:
where is the fraction of positive examples in the node.
Some key properties of entropy:
- → maximum entropy
- → maximum purity
Splitting a node helps reduce entropy, the amount of reduction is called information gain.
For a split that produces a node with weight and right weight
The split with the highest information gain is chosen.
Information gain is a splitting criterion.
If we have a categorical feature that can take on values, we create binary features(0 and 1)
For continuous features like weights, we need to pick a certain threshold with the highest information gain. The split becomes weight < threshold → yes/no
The procedure for using continuous features follows:
- We sort the examples by the feature value\
- We try each midpoint between consecutive distinct values as a candidate threshold
- We compute the information gain for each threshold
- We pick the threshold with the highest information gain
We can also use decision trees for regression, this way we can predict continuous values. The way we do this follows:
- Each leaf predicts the average of the target values of the examples in that leaf
- The splitting criterion we use here is the variance reduction
Decision trees are very sensitive when there is a small change in training data, they tend to produce different trees for each little change made to the training data. This is a high variance problem and the way to solve this is to use tree ensembles where we combine many trees to reduce the variance
A tree ensemble is basically a collection of trees
We can build an ensemble by sampling data from the training set with replacement. This gives us enough data to create multiple trees and train them with sampled data, we can then choose the one prediction that exists the most in all the tree for classification problems and we use the average in case of regression
Random forest is an ensemble decision tree with two sources of randomness, each tree is trained on a different bootstrap sample and at each node, we only consider a random set of features . Typically
Random features is what forces the trees to be more diverse and reduce the correlation between them. More diversity ⇒ better variance reduction
We also have boosted trees which unlike random forest are trained independently in parallel, boosting trees train sequentially. The aim of boosted tree is to sample in a way that the misclassified examples in the previous trees are better classified in the new tree.
XGBoost is the most important open source implementation of boosted trees(would love to contribute one day)
Decision trees are good for some problem and are bad at other, they are very good for tabular data with labels where neural networks can learn from any sort of data, decision trees are also comparably faster than neural network and so they train faster.
Bugs & Blockers
N/A
Concepts That Need More Time
The choosing the output of the multi-class classification due to errors in calculation(using the expression for the output a instead of using a directly) actually feels a bit confusing to me for now(still haven't figured this out man, forgot again)
Tomorrow
Implement a more robust decision tree from scratch(maybe implement random forest or boosted tree)
Figure out the standing concepts that need more time
Go through things I've learnt so far and revise and fill in some missing gaps
Read the grokking ML book chapter 8-10
Wins
Implemented decision tree from scratch and used it in a sample training dataset(google colab)
Learned a new set of things and models
Finished the Advanced Learning Algorithms course

Advanced Learning Algorithm Notes: Advanced Learning Algorithms Notes.pdf