Skip to content
truthxify
← Journal

Phase 2 — Classical ML

June 8, 2026

Continue the course 3 of the Machine Learning Specialization(Recommender Systems)

What I Did

Implemented reinforcement learning for a lunar lander example

What I Learned

Reinforcement Learning(RL) trains an agent to take actions in an environment, the agent does not get told what to action correct, it basically gets a reward signal instead.

There are some ML tasks we can't use supervised learning for like flying helicopters or running a Mars rover, so we need to employ the techniques of reinforcement learning here.

The return is the sum of the rewards the agent collects from a given starting point, but with future rewards discounted

G=R1+γR2+γ2R3+...G = R_1 + \gamma R_2 + \gamma^2R_3 + ...

Where RtR_t is the reward received at step tt and γ[0,1)\gamma \in [0,1) is the discount factor

Policy π\pi is a function that says, given a state ss, what actions to take: π(s)=a\pi(s)=a

The main goal of RL is to find a policy π\pi that maximizes expected return

It is also important to note that this framework is a Markov Decision Process(MDP) and it is Markov because the next state depends only on the current state and the action, not the history

The state-value function Q(s,a)Q(s,a) is the return we get if we:

  • start in state ss
  • take action aa once
  • then behave optimally from then on

The point here is that if we know Q(s,a)Q(s, a) for every state and action, we would know the optimal policy, we just pick the highest QQ

π(s)=argmaxaQ(s,a)\pi(s) = \arg\max_a Q(s, a)

The Bellman equation expresses Q(s,a)Q(s,a) in the form below:

Q(s,a)=R(s)+γmaxaQ(s,a)Q(s,a) = R(s) + \gamma \max_{a'} Q(s', a')

If ss is a terminal state, Q(s,a)=R(s)Q(s,a) = R(s)

When we have continuous spaces like we do have in real life, we will use a neural network to compute or approximate the state-value function Q(s,a)Q(s,a) and that will in turn allow us to pick good actions

We can also do some refinements when doing RL like the neural network architecture, ϵgreedy\epsilon-\text{greedy} policy, using mini-batch and soft update.

RL is great at simulated environments like games, robotics in controlled settings but most of these perform worse in real life, there are also far fewer applications than supervised and other unsupervised learning but it's seeing a great application in post training of LLMs right now

Bugs & Blockers

N/A

Concepts That Need More Time

Currently having a problem regarding some concepts like two different networks(Q-network and target Q-network)

Will need to do a deep dive into the code for RL also and do some more examples and train to solve some more problems(look closely and the gymnasium python library)

Need to still review full RL terms and concepts to fully understand how it works and what we are optimizing for.

Tomorrow

Take some time around RL to fully understand the code and some concepts as well

Wins

Finished the Machine Learning Specialization

Machine Learning Certificate.jpeg

Unsupervised Learning, Recommenders, Reinforcement Learning.jpeg

Unsupervised Learning Note: [Unsupervised Learning Notes]( Unsupervised Machine Learning.pdf )