Phase 2 — Classical ML

June 8, 2026

Continue the course 3 of the Machine Learning Specialization(Recommender Systems)

What I Did

Implemented reinforcement learning for a lunar lander example

What I Learned

Reinforcement Learning(RL) trains an agent to take actions in an environment, the agent does not get told what to action correct, it basically gets a reward signal instead.

There are some ML tasks we can't use supervised learning for like flying helicopters or running a Mars rover, so we need to employ the techniques of reinforcement learning here.

The return is the sum of the rewards the agent collects from a given starting point, but with future rewards discounted

G = R_1 + \gamma R_2 + \gamma^2R_3 + ...

Where $R_t$ is the reward received at step $t$ and $\gamma \in [0,1)$ is the discount factor

Policy $\pi$ is a function that says, given a state $s$ , what actions to take: $\pi(s)=a$

The main goal of RL is to find a policy $\pi$ that maximizes expected return

It is also important to note that this framework is a Markov Decision Process(MDP) and it is Markov because the next state depends only on the current state and the action, not the history

The state-value function $Q(s,a)$ is the return we get if we:

start in state $s$
take action $a$ once
then behave optimally from then on

The point here is that if we know $Q(s, a)$ for every state and action, we would know the optimal policy, we just pick the highest $Q$

\pi(s) = \arg\max_a Q(s, a)

The Bellman equation expresses $Q(s,a)$ in the form below:

Q(s,a) = R(s) + \gamma \max_{a'} Q(s', a')

If $s$ is a terminal state, $Q(s,a) = R(s)$

When we have continuous spaces like we do have in real life, we will use a neural network to compute or approximate the state-value function $Q(s,a)$ and that will in turn allow us to pick good actions

We can also do some refinements when doing RL like the neural network architecture, $\epsilon-\text{greedy}$ policy, using mini-batch and soft update.

RL is great at simulated environments like games, robotics in controlled settings but most of these perform worse in real life, there are also far fewer applications than supervised and other unsupervised learning but it's seeing a great application in post training of LLMs right now

Bugs & Blockers

N/A

Concepts That Need More Time

Currently having a problem regarding some concepts like two different networks(Q-network and target Q-network)

Will need to do a deep dive into the code for RL also and do some more examples and train to solve some more problems(look closely and the gymnasium python library)

Need to still review full RL terms and concepts to fully understand how it works and what we are optimizing for.

Tomorrow

Take some time around RL to fully understand the code and some concepts as well

Wins

Finished the Machine Learning Specialization

Machine Learning Certificate.jpeg

Unsupervised Learning, Recommenders, Reinforcement Learning.jpeg

Unsupervised Learning Note: Unsupervised Machine Learning.pdf

#ml