Phase 2 — Classical ML
June 8, 2026
Continue the course 3 of the Machine Learning Specialization(Recommender Systems)
What I Did
Implemented reinforcement learning for a lunar lander example
What I Learned
Reinforcement Learning(RL) trains an agent to take actions in an environment, the agent does not get told what to action correct, it basically gets a reward signal instead.
There are some ML tasks we can't use supervised learning for like flying helicopters or running a Mars rover, so we need to employ the techniques of reinforcement learning here.
The return is the sum of the rewards the agent collects from a given starting point, but with future rewards discounted
Where is the reward received at step and is the discount factor
Policy is a function that says, given a state , what actions to take:
The main goal of RL is to find a policy that maximizes expected return
It is also important to note that this framework is a Markov Decision Process(MDP) and it is Markov because the next state depends only on the current state and the action, not the history
The state-value function is the return we get if we:
- start in state
- take action once
- then behave optimally from then on
The point here is that if we know for every state and action, we would know the optimal policy, we just pick the highest
The Bellman equation expresses in the form below:
If is a terminal state,
When we have continuous spaces like we do have in real life, we will use a neural network to compute or approximate the state-value function and that will in turn allow us to pick good actions
We can also do some refinements when doing RL like the neural network architecture, policy, using mini-batch and soft update.
RL is great at simulated environments like games, robotics in controlled settings but most of these perform worse in real life, there are also far fewer applications than supervised and other unsupervised learning but it's seeing a great application in post training of LLMs right now
Bugs & Blockers
N/A
Concepts That Need More Time
Currently having a problem regarding some concepts like two different networks(Q-network and target Q-network)
Will need to do a deep dive into the code for RL also and do some more examples and train to solve some more problems(look closely and the gymnasium python library)
Need to still review full RL terms and concepts to fully understand how it works and what we are optimizing for.
Tomorrow
Take some time around RL to fully understand the code and some concepts as well
Wins
Finished the Machine Learning Specialization


Unsupervised Learning Note: [Unsupervised Learning Notes]( )