Mathematics Behind Reinforcement Learning: Key Concepts and Applications
Raj Shaikh 9 min read 1885 words1. Markov Decision Processes (MDPs): The Stage for Decision-Making 🎭
An MDP is a formal framework used in RL to describe decision-making in situations where outcomes are partly random and partly under the agent’s control.
Key Ingredients of an MDP 🧪
-
States (\( S \)):
- The “where” of the agent.
- Example: The grid location of a robot vacuum. 🧹
-
Actions (\( A \)):
- The “what” the agent can do.
- Example: Move left, right, up, or down.
-
Transition Probabilities (\( P(s'|s, a) \)):
- The “how likely” the agent will land in a new state \( s' \) after taking action \( a \) in state \( s \).
- Example: If a slippery floor exists, moving right might only succeed 70% of the time.
-
Rewards (\( R(s, a, s') \)):
- The “why” the agent does anything—it’s all about maximizing rewards.
- Example: +10 points for reaching a goal, -5 for falling into a pit.
-
Discount Factor (\( \gamma \)):
- Balances short-term and long-term rewards.
- Example: A cookie now 🍪 (reward = +10) vs. a dozen cookies later (reward = +120 but delayed).
MDP Dynamics in Action
Imagine teaching a robot to fetch coffee ☕:
-
States (\( S \)):
- \( s_1 \): Robot is at the starting point.
- \( s_2 \): Robot is in the kitchen.
- \( s_3 \): Robot has coffee.
-
Actions (\( A \)):
- Move to the kitchen, grab coffee, deliver coffee.
-
Rewards (\( R \)):
- \( +10 \) for delivering coffee.
- \( -1 \) for wandering around aimlessly.
The Markov Property
An MDP assumes the Markov Property, which means:
- The future state \( s' \) depends only on the current state \( s \) and action \( a \), not the past. \[ P(s' | s, a, \text{history}) = P(s' | s, a) \]
In simpler terms: No grudges! The agent doesn’t care about how it got here—only where it is and what it does next. 🚶♂️
Numerical Example: MDP in a Grid World
Suppose an agent is navigating a 2x2 grid.
State | Action | Next State | Reward | Probability |
---|---|---|---|---|
(0,0) | Right | (0,1) | +0 | 1.0 |
(0,1) | Down | (1,1) | +10 | 1.0 |
(1,1) | Left | (1,0) | +0 | 1.0 |
- Initial State: (0,0)
- Action: Move Right → New State: (0,1), Reward: +0
- Action: Move Down → New State: (1,1), Reward: +10 🎉
Mathematical Formulation
-
Transition Model:
\[ P(s'|s,a) = \text{Probability of landing in } s' \text{ after taking } a \text{ in } s. \] -
Reward Function:
\[ R(s, a, s') = \text{Immediate reward received after transitioning.} \] -
Goal: Find a policy \( \pi(a|s) \) that maximizes the expected sum of rewards:
\[ \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t, s_{t+1})\right] \]
Why MDPs Matter
- Foundation for RL:
- MDPs formalize decision-making in uncertain environments.
- General Framework:
- MDPs apply to games, robotics, resource management, and more.
- Sets the Stage:
- Everything in RL, from Q-learning to policy gradients, builds on MDPs.
Code Example: MDP Simulation
Here’s a basic Python simulation of an MDP:
import numpy as np
# Define MDP components
states = ["Start", "Kitchen", "Goal"]
actions = ["Move", "Grab", "Deliver"]
rewards = {
("Start", "Move", "Kitchen"): 0,
("Kitchen", "Grab", "Kitchen"): 0,
("Kitchen", "Deliver", "Goal"): 10
}
transitions = {
("Start", "Move"): "Kitchen",
("Kitchen", "Grab"): "Kitchen",
("Kitchen", "Deliver"): "Goal"
}
# Simulate MDP
state = "Start"
total_reward = 0
for action in ["Move", "Grab", "Deliver"]:
next_state = transitions[(state, action)]
reward = rewards[(state, action, next_state)]
total_reward += reward
state = next_state
print(f"State: {state}, Reward: {reward}")
print("Total Reward:", total_reward)
Fun Analogy
An MDP is like playing a board game 🎲:
- States: The squares on the board.
- Actions: Move left, right, up, or down.
- Rewards: Pick up coins or avoid traps.
- Transition Probabilities: Dice rolls determine where you land next.
The agent? It’s the player trying to win by strategizing its moves! 🏆
Mermaid.js Diagram: MDP Flow
graph TD Start[Start State] --> Action[Choose Action] Action --> Transition[Apply Transition Probabilities] Transition --> NextState[Move to Next State] NextState --> Reward[Receive Reward] Reward --> Action
2. Bellman Equations: The Crystal Ball of RL 🔮✨
What Are the Bellman Equations?
The Bellman Equations are the backbone of RL, describing the relationship between the value of a state (or state-action pair) and the rewards and transitions that follow. They tell the agent: “If you’re here, this is how good it’s likely to be, considering both immediate rewards and future prospects.”
Types of Bellman Equations
1. State-Value Function (\( V(s) \)): The Yelp for States ⭐
The state-value function measures how good it is to be in a state \( s \), assuming the agent follows a specific policy \( \pi \):
\[ V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \, | \, s_0 = s\right] \]Simpler Form (Recursive Bellman Equation):
\[ V^\pi(s) = \mathbb{E}_\pi[R(s, a) + \gamma V^\pi(s') \, | \, s, a] \]Here:
- \( R(s, a) \): Immediate reward.
- \( \gamma \): Discount factor (how much the agent values future rewards).
- \( V^\pi(s') \): Value of the next state.
2. Action-Value Function (\( Q(s, a) \)): Yelp for Actions in States ⭐⚡
The action-value function measures how good it is to take action \( a \) in state \( s \) and then follow policy \( \pi \):
\[ Q^\pi(s, a) = \mathbb{E}_\pi[R(s, a) + \gamma Q^\pi(s', a') \, | \, s, a] \]It’s like asking: “If I do this now, how good will things turn out?”
Breaking Down the Bellman Equations
-
Immediate Reward: What do I get for being here and doing this?
\[ R(s, a) \] -
Future Value: How good is the next state?
\[ \gamma \cdot \text{Value of Next State} \] -
Combine:
\[ \text{Value} = \text{Immediate Reward} + \text{Discounted Future Value} \]
Numerical Example
Suppose an agent is navigating a 3x1 grid world:
State | Action | Next State | Reward |
---|---|---|---|
\( s_1 \) | Right | \( s_2 \) | \( +1 \) |
\( s_2 \) | Right | \( s_3 \) | \( +10 \) |
\( s_3 \) | None | \( s_3 \) | \( 0 \) |
Let \( \gamma = 0.9 \).
Compute \( V(s_3) \):
\[ V(s_3) = R(s_3) + \gamma V(s_3) = 0 + 0.9 \cdot 0 = 0 \]Compute \( V(s_2) \):
\[ V(s_2) = R(s_2) + \gamma V(s_3) = 10 + 0.9 \cdot 0 = 10 \]Compute \( V(s_1) \):
\[ V(s_1) = R(s_1) + \gamma V(s_2) = 1 + 0.9 \cdot 10 = 10 \]Why Bellman Equations Matter
- Planning:
- Help agents evaluate the long-term value of states and actions.
- Policy Optimization:
- Used to improve policies by choosing actions with higher \( Q(s, a) \).
- Foundation for RL Algorithms:
- From Q-learning to deep Q-networks (DQNs), Bellman equations are everywhere.
Code Example: Bellman Equations in Python
Here’s how to compute state values iteratively:
# Define states, rewards, and transitions
states = [1, 2, 3]
rewards = {1: 1, 2: 10, 3: 0}
transitions = {1: 2, 2: 3, 3: 3}
gamma = 0.9
# Initialize values
values = {s: 0 for s in states}
# Value iteration
for i in range(10): # Iterate 10 times
new_values = {}
for s in states:
next_state = transitions[s]
new_values[s] = rewards[s] + gamma * values[next_state]
values = new_values
print(f"Iteration {i + 1}: {values}")
Fun Analogy
The Bellman Equations are like deciding whether to eat cake 🍰 or go to the gym 🏋️♂️:
- Immediate reward: Cake = +10 (yum!), Gym = -5 (sweat, ugh!).
- Future reward: Cake → regret (-20), Gym → fit body (+50).
- Bellman says: Maximize total value. Choose the gym (most of the time).
Mermaid.js Diagram: Bellman Equation Flow
graph TD State[Current State s] --> Action[Choose Action a] Action --> Reward["Receive Reward R(s, a)"] Action --> NextState[Transition to Next State s'] NextState --> FutureValue["Discounted Future Value gamma* V(s')"] Reward --> TotalValue[Compute Total Value] FutureValue --> TotalValue
3. Policy Gradient Methods: Teaching AI to Roll the Dice with Style 🎲🤖
What Are Policy Gradient Methods?
In Policy Gradient Methods, we directly parameterize the policy \( \pi_\theta(a|s) \), which is the probability of taking action \( a \) in state \( s \) given parameters \( \theta \). The goal is to optimize these parameters to maximize the expected cumulative reward.
The Policy Gradient Objective
The agent aims to maximize the expected return \( J(\theta) \):
\[ J(\theta) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t)\right] \]Instead of summing up rewards, we compute the gradient of this objective to tweak the policy parameters \( \theta \) in the right direction.
The Policy Gradient Theorem
The Policy Gradient Theorem provides a clever way to compute the gradient of \( J(\theta) \) without needing to differentiate through the environment dynamics:
\[ \nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right] \]Where:
- \( \nabla_\theta \log \pi_\theta(a_t|s_t) \): How the policy changes with respect to \( \theta \).
- \( G_t \): The return (cumulative discounted reward) from timestep \( t \).
In simpler terms:
- Log probabilities tell us how to adjust the policy.
- Rewards tell us how much to adjust.
Steps in Policy Gradient Methods
1. Sample Trajectories
Run the agent in the environment to collect trajectories (state, action, reward sequences):
\[ \text{Trajectory: } (s_0, a_0, r_1, s_1, a_1, r_2, \ldots) \]2. Compute Returns
For each timestep \( t \), compute the cumulative reward:
\[ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots \]3. Update Policy
Use the gradient to nudge the policy parameters \( \theta \):
\[ \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) \]Here, \( \alpha \) is the learning rate.
Numerical Example
Let’s train an agent in a simple environment:
Environment:
- States: \( s_1, s_2 \).
- Actions: \( a_1, a_2 \).
- Rewards: \( +1 \) for reaching the goal, \( 0 \) otherwise.
Trajectory:
- \( (s_1, a_1, r_1 = 1) \).
Policy:
Suppose the policy is:
\[ \pi_\theta(a|s) = \text{softmax}(\theta \cdot \phi(s, a)) \]Where \( \phi(s, a) \) is a feature vector.
Update Rule:
For \( G_t = 1 \):
\[ \nabla_\theta J(\theta) = \nabla_\theta \log \pi_\theta(a_1|s_1) \cdot 1 \]Why Policy Gradient Methods Are Cool
- Direct Optimization:
- Learn policies directly, skipping the value function detours.
- Continuous Action Spaces:
- Works seamlessly with actions like steering angles or velocities.
- Stochastic Policies:
- Encourages exploration, helping the agent discover better strategies.
Challenges
- High Variance:
- Gradient estimates can be noisy, slowing down learning.
- Solution: Use tricks like baseline subtraction (e.g., Advantage Actor-Critic).
- Sample Inefficiency:
- Requires lots of environment interactions to compute gradients.
Code Example: Policy Gradient in Python
Here’s a simple implementation using PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
# Define policy network
class PolicyNet(nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.fc = nn.Linear(input_dim, output_dim)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
return self.softmax(self.fc(x))
# Example environment
policy = PolicyNet(input_dim=4, output_dim=2) # Example state/action sizes
optimizer = optim.Adam(policy.parameters(), lr=0.01)
# Sample trajectory
states = torch.tensor([[1.0, 0.0, 0.0, 0.0]]) # Example state
actions = torch.tensor([0]) # Example action
rewards = torch.tensor([1.0]) # Example reward
# Compute policy loss
log_probs = torch.log(policy(states))
loss = -log_probs[0, actions] * rewards # Policy Gradient Loss
# Update policy
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Updated Policy Parameters:", list(policy.parameters()))
Fun Analogy
Policy Gradient is like training a dog 🐶:
- Sample trajectories: The dog tries random tricks.
- Compute returns: Reward the good tricks (fetch the ball) with a treat. Ignore bad tricks (chew your shoe).
- Update policy: The dog learns to fetch instead of destroying your wardrobe. 🦴🎉
Mermaid.js Diagram: Policy Gradient Workflow
graph TD Start[Initialize Policy Network] --> Sample[Sample Trajectories] Sample --> Returns[Compute Returns G_t] Returns --> Gradient[Calculate Policy Gradient] Gradient --> Update[Update Policy Parameters] Update --> Repeat[Repeat Until Convergence]