Monte Carlo Methods in Reinforcement Learning: A Deep Dive
Monte Carlo Methods in Reinforcement Learning: A Deep Dive
A comprehensive walkthrough of Monte Carlo prediction — from theoretical foundations to implementation pitfalls.
1. The Core Motivation: Sampling Easy, Distributions Hard
A key insight that motivates Monte Carlo methods is this observation from Sutton & Barto:
“In surprisingly many cases it is easy to generate experience sampled according to the desired probability distributions, but infeasible to obtain the distributions in explicit form.”
What does this mean in practice? Here are 10 concrete examples where sampling is easy but writing down the distribution is impossible:
| # | Domain | How to Sample | Why Explicit Form is Hard |
|---|---|---|---|
| 1 | Reinforcement Learning | Agent interacts with environment | State space explosion ($10^{170}$ in Go) |
| 2 | Protein Folding | Run MD simulation | Partition function $Z = \int e^{-E(\mathbf{x})/kT} d\mathbf{x}$ is intractable |
| 3 | Weather Systems | Run numerical forecast model | Nonlinear PDE (Navier-Stokes), no closed form |
| 4 | Natural Images | Take a photo | Distribution over $\mathbb{R}^{H \times W \times 3}$ is impossibly complex |
| 5 | Financial Markets | Replay historical data / simulate | Jump processes + stochastic volatility |
| 6 | Bayesian Posterior | MCMC (HMC/NUTS) | Normalizing constant $p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)d\theta$ intractable |
| 7 | LLM Outputs | Autoregressive sampling | Hundreds of billions of parameters, no closed form |
| 8 | Chemical Reaction Networks | Gillespie algorithm | Chemical Master Equation explodes in dimensionality |
| 9 | Traffic Simulation | Multi-agent simulator (SUMO) | Joint state distribution too complex |
| 10 | Combinatorial Optimization | Simulated annealing / GA | $(n-1)!/2$ permutations for TSP |
2. Can Monte Carlo Solve All of These?
Not equally well. They fall into three categories:
✅ Directly Solvable with MC (Examples 2, 3, 5, 6, 10)
These have a well-defined target distribution $\pi(x)$ (even if only known up to a normalizing constant), and MC integration applies naturally:
- Protein folding → Metropolis-Hastings MCMC over conformations
- Option pricing → simulate price paths, average payoffs
- Bayesian posterior → HMC/NUTS (the heart of Stan/PyMC)
⚠️ Applicable but with Severe Limitations (Examples 1, 4, 8, 9)
Reinforcement Learning (Example 1): Monte Carlo Policy Evaluation works — run full episodes and average returns. But high variance over long episodes motivated the invention of Temporal Difference (TD) learning as a lower-variance alternative.
Natural Images (Example 4): In theory you could sample randomly in pixel space, but the probability of landing on a realistic image is astronomically small. In practice, GANs and Diffusion Models are used to learn the distribution structure first.
Chemical Reactions (Example 8): The Gillespie algorithm is itself an exact MC method, but it processes one reaction event at a time — prohibitively slow for high-frequency systems. tau-leaping approximations are needed.
Traffic Simulation (Example 9): A single simulation run is already expensive. MC over uncertain parameters is possible but impractical at scale due to correlated multi-agent interactions.
❌ MC Cannot Directly Solve (Example 7)
LLM output distributions have no fixed target distribution to integrate against. The output space is all possible token sequences — size $\sim {10^5}^{10^3}$. There is no $\pi(x)$ to evaluate. Instead:
- Autoregressive sampling (softmax at each token step) replaces MC
- RLHF uses policy gradient methods, not MC integration
The key distinction:
MC works when: ✅ You know what distribution to sample from
MC fails when: ❌ The distribution itself needs to be learned
3. First-Visit Monte Carlo Prediction
The Algorithm
The standard First-Visit MC algorithm for estimating state values $V(s)$:
Initialize:
V(s) arbitrarily for all s
Returns(s) ← empty list for all s
Loop forever (each episode):
Generate episode: S0, A0, R1, S1, A1, R2, ..., ST
G ← 0
Loop for each step t = T-1, T-2, ..., 0:
G ← G + R_{t+1}
Unless St appears in S0, S1, ..., S_{t-1}:
Append G to Returns(St)
V(St) ← average(Returns(St))
Understanding “Unless”
The Unless keyword trips up many people. Translated to plain logic:
# "Unless St appears in S0...S_{t-1}" means:
if St NOT IN {S0, S1, ..., S_{t-1}}: # This is the FIRST visit
Returns[St].append(G)
V[St] = average(Returns[St])
- ✅ First visit → append G, update V
- ❌ Already appeared earlier → skip
4. A Critical Implementation Bug
Here is a common buggy implementation:
def learn(self, policy, num_episodes=10000, first_visit=True):
env = BlackjackEnv()
for episode_num in range(num_episodes):
episode = self.generate_episode(policy, env)
G = 0
visited_states = set()
for t in range(len(episode) - 1, -1, -1):
state, action, reward = episode[t]
G = reward + G
# BUG: This checks visited_states populated from LATER steps,
# not EARLIER steps in the episode timeline!
if first_visit and state in visited_states:
continue
visited_states.add(state)
self.returns[state].append(G)
self.V[state] = sum(self.returns[state]) / len(self.returns[state])
Why is this wrong?
The loop runs backward (t = T-1 → 0). When we encounter a state at time t, visited_states contains states from later time steps (t+1, t+2, …), not earlier ones. So the check is inverted — it accidentally keeps the last visit instead of the first visit.
✅ The Correct Implementation
def learn(self, policy, num_episodes=10000, first_visit=True):
env = BlackjackEnv()
for episode_num in range(num_episodes):
episode = self.generate_episode(policy, env)
G = 0
# Pre-compute first visit index for each state (forward pass)
if first_visit:
first_visit_indices = {}
for t, (state, action, reward) in enumerate(episode):
if state not in first_visit_indices:
first_visit_indices[state] = t
# Process backwards to accumulate returns
for t in range(len(episode) - 1, -1, -1):
state, action, reward = episode[t]
G = reward + G # reward here is R_{t+1}
# Only update at the first visit
if first_visit and first_visit_indices[state] != t:
continue
self.returns[state].append(G)
self.V[state] = sum(self.returns[state]) / len(self.returns[state])
if (episode_num + 1) % 10000 == 0:
print(f"Episode {episode_num + 1}/{num_episodes} completed")
return self.V
5. Return Calculation: $R_{t+1}$, Not $R_t$
This is a subtle but important detail. The pseudocode says:
G ← G + R_{t+1}
The reward received at step t is $R_{t+1}$ — the reward obtained after leaving state $S_t$ by taking action $A_t$. Each element of the episode tuple is $(S_t, A_t, R_{t+1})$.
Concrete Example
Episode states: 1 → 2 → 3 → 4 → 3 → 5 → 2 → 6 (terminal)
Assuming state number = reward value received when leaving that state:
| t | $S_t$ | $R_{t+1}$ |
|---|---|---|
| 0 | 1 | 2 |
| 1 | 2 | 3 |
| 2 | 3 | 4 |
| 3 | 4 | 3 |
| 4 | 3 | 5 |
| 5 | 5 | 2 |
| 6 | 2 | 6 |
| 7 | 6 | — (terminal) |
First-visit indices: state 1→t=0, state 2→t=1, state 3→t=2, state 4→t=3, state 5→t=5, state 6→t=7
Backward return calculation:
| t | $S_t$ | $R_{t+1}$ | G = G + $R_{t+1}$ | First Visit? | Append? |
|---|---|---|---|---|---|
| 7 | 6 | — | 0 | ✅ | ✅ G=0 |
| 6 | 2 | 6 | 0+6=6 | ❌ | ❌ skip |
| 5 | 5 | 2 | 6+2=8 | ✅ | ✅ G=8 |
| 4 | 3 | 5 | 8+5=13 | ❌ | ❌ skip |
| 3 | 4 | 3 | 13+3=16 | ✅ | ✅ G=16 |
| 2 | 3 | 4 | 16+4=20 | ✅ | ✅ G=20 |
| 1 | 2 | 3 | 20+3=23 | ✅ | ✅ G=23 |
| 0 | 1 | 2 | 23+2=25 | ✅ | ✅ G=25 |
Verification:
- State 2 (first visit at t=1): G = 3+4+3+5+2+6 = 23 ✅
- State 3 (first visit at t=2): G = 4+3+5+2+6 = 20 ✅
- State 5 (t=5): G = 2+6 = 8 ✅
Note: even though state 3 is visited again at t=4 (G=13 at that point), we skip it because t=4 ≠ first_visit_index[3]=2. G continues accumulating though — it reaches 20 by the time we process t=2.
6. Why Does Averaging Returns Converge?
The theoretical guarantee comes from the Law of Large Numbers:
\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n}X_i \xrightarrow{n\to\infty} \mathbb{E}[X]\]Each time the agent visits state $s$ across different episodes, it generates a return $G_i$ drawn from the distribution $p(G \mid s, \pi)$. Since:
- Same distribution: same policy $\pi$ + same environment dynamics → $G_i$ are identically distributed
- Independent: episodes are independent of each other
The Law of Large Numbers guarantees:
\[V(s) = \frac{1}{n}\sum_{i=1}^{n} G_i \xrightarrow{n\to\infty} \mathbb{E}_\pi[G \mid S_t = s] = V^\pi(s)\]Intuition with numbers (true value $V^*(s) = 20$, unknown to us):
After episode 1: G=15 → estimate = 15.0
After episode 2: G=25 → estimate = 20.0
After episode 3: G=12 → estimate = 17.3
After episode 4: G=28 → estimate = 20.0
...
After episode ∞: → estimate → 20.0 ✅
Required conditions for convergence:
- Every state is visited infinitely often (sufficient exploration)
- Returns are i.i.d. (fixed policy, stationary environment)
- Variance of G is finite (bounded rewards)
MC is essentially replacing an intractable integral with a tractable sample average:
\[\underbrace{\frac{1}{n}\sum_{i=1}^{n}G_i}_{\text{MC estimate}} \approx \underbrace{\mathbb{E}_\pi[G \mid s]}_{\text{true value (integral)}}\]7. Why Discount Future Rewards?
The discounted return is defined as:
\[G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}, \quad \gamma \in [0, 1)\]There are four independent justifications for this formulation:
📐 Mathematical: Guaranteeing Convergence
Without discounting ($\gamma = 1$), infinite-horizon returns can diverge:
\[G_t = R_{t+1} + R_{t+2} + \cdots = \infty\]| With $\gamma < 1$ and bounded rewards $ | R | \leq R_{\max}$: |
💰 Economic: Time Value of Money
Directly borrowed from finance. A dollar today is worth more than a dollar tomorrow because today’s dollar can be invested. The discount rate $1 - \gamma$ plays the role of the interest rate. Future rewards are “less valuable” because they are less certain.
🎲 Probabilistic: Modeling Uncertainty
$\gamma$ can be interpreted as the probability that the episode continues to the next step:
\[\gamma = P(\text{episode continues})\]With a $(1 - \gamma)$ chance of termination at every step, the expected contribution of a reward $k$ steps in the future is naturally $\gamma^k R_{t+k+1}$.
🧠 Behavioral: Biological Time Preference
Psychological experiments consistently show that animals (including humans) prefer immediate rewards. The classical pigeon experiment: given a choice between 1 food pellet now vs. 3 pellets in 10 seconds, pigeons often choose the immediate reward. $\gamma$ quantifies this temporal preference:
| $\gamma$ | Behavior |
|---|---|
| 0.0 | Completely myopic — only cares about immediate reward |
| 0.9 | 10-step future reward weighted at $0.9^{10} \approx 0.35$ |
| 0.99 | 100-step future reward weighted at $0.99^{100} \approx 0.37$ |
| 1.0 | Fully far-sighted — all future rewards equal weight |
Summary
| Concept | Key Takeaway |
|---|---|
| Motivation | Sampling from a distribution is often feasible even when the distribution has no closed form |
| MC applicability | Works when target distribution is known; fails when it must be learned |
| First-visit vs every-visit | First-visit: only count the return from the first time a state appears per episode |
| Common bug | Backward loop + running visited_states set checks future steps, not past ones |
| $R_{t+1}$ vs $R_t$ | The reward in $(S_t, A_t, R_{t+1})$ is earned by leaving $S_t$, not entering it |
| Convergence | Law of Large Numbers: sample mean → true expectation as $n \to \infty$ |
| Discounting | Mathematical necessity + economic intuition + probabilistic uncertainty + biological preference |
The philosophy of Monte Carlo: You don’t need a model of the world. You just need enough experience. Statistical regularity will emerge.