Skip to content
truthxify
← Journal

Phase 2 — Classical ML

May 22, 2026

Continue Machine Learning Specialization(Advanced Learning Algorithms)

What I Did

Implemented decision tree from scratch and used it in a sample training dataset

Watched a couple week 4 lectures but didn't finish it due to time constraint

What I Learned

Learned a whole bunch about decision trees.

Decision tree is a model that makes prediction by answering a series of yes/no question about the input data

There are steps to which a decision tree learns:

  • We start at the root level with all training examples
  • We pick the best feature to use in splitting(the aim is to maximize the purity and we can pick the feature that gives the highest information gain)
  • We partition the data based on that feature
  • We repeat the process to each of the child nodes(recursively)
  • We stop when a certain criterion is met

The stopping criterion may be one or all of the following:

  • The node has 100% purity
  • We reached maximum depth
  • The information gain by further splits is below a threshold
  • The number of examples in a node is below a minimum

The stopping criteria is actually very useful because it helps prevent the model from memorizing the training example(overfitting)

The way we measure purity is using the entropy:

H(p1)=p1log2(p1)(1p1)log2(1p1)H(p_1) = -p_1log_2(p_1) - (1-p_1)log_2(1-p_1)

where p1p_1 is the fraction of positive examples in the node.

Some key properties of entropy:

  • H(0.5)=1H(0.5) = 1 → maximum entropy
  • H(0)=H(1)=1H(0) = H(1) = 1 → maximum purity

Splitting a node helps reduce entropy, the amount of reduction is called information gain.

For a split that produces a node with weight wleftw^{left} and right weight wrightw^{right}

Information Gain=H(p1parent)[wleftH(p1left)+wrightH(p1right)]Information \ Gain = H(p_1^{parent}) - \left[w^{left}H(p_1^{left}) + w^{right}H(p_1^{right})\right]

The split with the highest information gain is chosen.

Information gain is a splitting criterion.

If we have a categorical feature that can take on KK values, we create KK binary features(0 and 1)

For continuous features like weights, we need to pick a certain threshold with the highest information gain. The split becomes weight < threshold → yes/no

The procedure for using continuous features follows:

  • We sort the examples by the feature value\
  • We try each midpoint between consecutive distinct values as a candidate threshold
  • We compute the information gain for each threshold
  • We pick the threshold with the highest information gain

We can also use decision trees for regression, this way we can predict continuous values. The way we do this follows:

  • Each leaf predicts the average of the target values of the examples in that leaf
  • The splitting criterion we use here is the variance reduction
Variance Reduction=V(parent)[wleftV(left)+wrightV(right)]Variance \ Reduction = V(parent) - \left[w^{left}V(left) + w^{right}V(right)\right]

Decision trees are very sensitive when there is a small change in training data, they tend to produce different trees for each little change made to the training data. This is a high variance problem and the way to solve this is to use tree ensembles where we combine many trees to reduce the variance

A tree ensemble is basically a collection of trees

We can build an ensemble by sampling data from the training set with replacement. This gives us enough data to create multiple trees and train them with sampled data, we can then choose the one prediction that exists the most in all the tree for classification problems and we use the average in case of regression

Random forest is an ensemble decision tree with two sources of randomness, each tree is trained on a different bootstrap sample and at each node, we only consider a random set of features k<nk < n. Typically k=nk = \sqrt{n}

Random features is what forces the trees to be more diverse and reduce the correlation between them. More diversity ⇒ better variance reduction

We also have boosted trees which unlike random forest are trained independently in parallel, boosting trees train sequentially. The aim of boosted tree is to sample in a way that the misclassified examples in the previous trees are better classified in the new tree.

XGBoost is the most important open source implementation of boosted trees(would love to contribute one day)

Decision trees are good for some problem and are bad at other, they are very good for tabular data with labels where neural networks can learn from any sort of data, decision trees are also comparably faster than neural network and so they train faster.

Bugs & Blockers

N/A

Concepts That Need More Time

The choosing the output of the multi-class classification due to errors in calculation(using the expression for the output a instead of using a directly) actually feels a bit confusing to me for now(still haven't figured this out man, forgot again)

Tomorrow

Implement a more robust decision tree from scratch(maybe implement random forest or boosted tree)

Figure out the standing concepts that need more time

Go through things I've learnt so far and revise and fill in some missing gaps

Read the grokking ML book chapter 8-10

Wins

Implemented decision tree from scratch and used it in a sample training dataset(google colab)

Learned a new set of things and models

Finished the Advanced Learning Algorithms course

Advanced Learning Algorithms Certificate.jpeg

Advanced Learning Algorithm Notes: Advanced Learning Algorithms Notes.pdf