Monte Carlo Methods in Reinforcement Learning: A Deep Dive

A comprehensive walkthrough of Monte Carlo prediction — from theoretical foundations to implementation pitfalls.

1. The Core Motivation: Sampling Easy, Distributions Hard

A key insight that motivates Monte Carlo methods is this observation from Sutton & Barto:

“In surprisingly many cases it is easy to generate experience sampled according to the desired probability distributions, but infeasible to obtain the distributions in explicit form.”

What does this mean in practice? Here are 10 concrete examples where sampling is easy but writing down the distribution is impossible:

#	Domain	How to Sample	Why Explicit Form is Hard
1	Reinforcement Learning	Agent interacts with environment	State space explosion ($10^{170}$ in Go)
2	Protein Folding	Run MD simulation	Partition function $Z = \int e^{-E(\mathbf{x})/kT} d\mathbf{x}$ is intractable
3	Weather Systems	Run numerical forecast model	Nonlinear PDE (Navier-Stokes), no closed form
4	Natural Images	Take a photo	Distribution over $\mathbb{R}^{H \times W \times 3}$ is impossibly complex
5	Financial Markets	Replay historical data / simulate	Jump processes + stochastic volatility
6	Bayesian Posterior	MCMC (HMC/NUTS)	Normalizing constant $p(\mathcal{D}) = \int p(\mathcal{D}\|\theta)p(\theta)d\theta$ intractable
7	LLM Outputs	Autoregressive sampling	Hundreds of billions of parameters, no closed form
8	Chemical Reaction Networks	Gillespie algorithm	Chemical Master Equation explodes in dimensionality
9	Traffic Simulation	Multi-agent simulator (SUMO)	Joint state distribution too complex
10	Combinatorial Optimization	Simulated annealing / GA	$(n-1)!/2$ permutations for TSP

2. Can Monte Carlo Solve All of These?

Not equally well. They fall into three categories:

✅ Directly Solvable with MC (Examples 2, 3, 5, 6, 10)

These have a well-defined target distribution $\pi(x)$ (even if only known up to a normalizing constant), and MC integration applies naturally:

Protein folding → Metropolis-Hastings MCMC over conformations
Option pricing → simulate price paths, average payoffs
Bayesian posterior → HMC/NUTS (the heart of Stan/PyMC)

⚠️ Applicable but with Severe Limitations (Examples 1, 4, 8, 9)

Reinforcement Learning (Example 1): Monte Carlo Policy Evaluation works — run full episodes and average returns. But high variance over long episodes motivated the invention of Temporal Difference (TD) learning as a lower-variance alternative.

Natural Images (Example 4): In theory you could sample randomly in pixel space, but the probability of landing on a realistic image is astronomically small. In practice, GANs and Diffusion Models are used to learn the distribution structure first.

Chemical Reactions (Example 8): The Gillespie algorithm is itself an exact MC method, but it processes one reaction event at a time — prohibitively slow for high-frequency systems. tau-leaping approximations are needed.

Traffic Simulation (Example 9): A single simulation run is already expensive. MC over uncertain parameters is possible but impractical at scale due to correlated multi-agent interactions.

❌ MC Cannot Directly Solve (Example 7)

LLM output distributions have no fixed target distribution to integrate against. The output space is all possible token sequences — size $\sim {10^5}^{10^3}$. There is no $\pi(x)$ to evaluate. Instead:

Autoregressive sampling (softmax at each token step) replaces MC
RLHF uses policy gradient methods, not MC integration

The key distinction:

MC works when:  ✅ You know what distribution to sample from
MC fails when:  ❌ The distribution itself needs to be learned

3. First-Visit Monte Carlo Prediction

The Algorithm

The standard First-Visit MC algorithm for estimating state values $V(s)$:

Initialize:
    V(s) arbitrarily for all s
    Returns(s) ← empty list for all s

Loop forever (each episode):
    Generate episode: S0, A0, R1, S1, A1, R2, ..., ST
    G ← 0
    Loop for each step t = T-1, T-2, ..., 0:
        G ← G + R_{t+1}
        Unless St appears in S0, S1, ..., S_{t-1}:
            Append G to Returns(St)
            V(St) ← average(Returns(St))

Understanding “Unless”

The Unless keyword trips up many people. Translated to plain logic:

# "Unless St appears in S0...S_{t-1}" means:
if St NOT IN {S0, S1, ..., S_{t-1}}:   # This is the FIRST visit
    Returns[St].append(G)
    V[St] = average(Returns[St])

✅ First visit → append G, update V
❌ Already appeared earlier → skip

4. A Critical Implementation Bug

Here is a common buggy implementation:

def learn(self, policy, num_episodes=10000, first_visit=True):
    env = BlackjackEnv()
    for episode_num in range(num_episodes):
        episode = self.generate_episode(policy, env)
        G = 0
        visited_states = set()

        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = reward + G

            # BUG: This checks visited_states populated from LATER steps,
            # not EARLIER steps in the episode timeline!
            if first_visit and state in visited_states:
                continue

            visited_states.add(state)
            self.returns[state].append(G)
            self.V[state] = sum(self.returns[state]) / len(self.returns[state])

Why is this wrong?

The loop runs backward (t = T-1 → 0). When we encounter a state at time t, visited_states contains states from later time steps (t+1, t+2, …), not earlier ones. So the check is inverted — it accidentally keeps the last visit instead of the first visit.

✅ The Correct Implementation

def learn(self, policy, num_episodes=10000, first_visit=True):
    env = BlackjackEnv()
    for episode_num in range(num_episodes):
        episode = self.generate_episode(policy, env)
        G = 0

        # Pre-compute first visit index for each state (forward pass)
        if first_visit:
            first_visit_indices = {}
            for t, (state, action, reward) in enumerate(episode):
                if state not in first_visit_indices:
                    first_visit_indices[state] = t

        # Process backwards to accumulate returns
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = reward + G  # reward here is R_{t+1}

            # Only update at the first visit
            if first_visit and first_visit_indices[state] != t:
                continue

            self.returns[state].append(G)
            self.V[state] = sum(self.returns[state]) / len(self.returns[state])

        if (episode_num + 1) % 10000 == 0:
            print(f"Episode {episode_num + 1}/{num_episodes} completed")

    return self.V

5. Return Calculation: $R_{t+1}$, Not $R_t$

This is a subtle but important detail. The pseudocode says:

G ← G + R_{t+1}

The reward received at step t is $R_{t+1}$ — the reward obtained after leaving state $S_t$ by taking action $A_t$. Each element of the episode tuple is $(S_t, A_t, R_{t+1})$.

Concrete Example

Episode states: 1 → 2 → 3 → 4 → 3 → 5 → 2 → 6 (terminal)

Assuming state number = reward value received when leaving that state:

t	$S_t$	$R_{t+1}$
0	1	2
1	2	3
2	3	4
3	4	3
4	3	5
5	5	2
6	2	6
7	6	— (terminal)

First-visit indices: state 1→t=0, state 2→t=1, state 3→t=2, state 4→t=3, state 5→t=5, state 6→t=7

Backward return calculation:

t	$S_t$	$R_{t+1}$	G = G + $R_{t+1}$	First Visit?	Append?
7	6	—	0	✅	✅ G=0
6	2	6	0+6=6	❌	❌ skip
5	5	2	6+2=8	✅	✅ G=8
4	3	5	8+5=13	❌	❌ skip
3	4	3	13+3=16	✅	✅ G=16
2	3	4	16+4=20	✅	✅ G=20
1	2	3	20+3=23	✅	✅ G=23
0	1	2	23+2=25	✅	✅ G=25

Verification:

State 2 (first visit at t=1): G = 3+4+3+5+2+6 = 23 ✅
State 3 (first visit at t=2): G = 4+3+5+2+6 = 20 ✅
State 5 (t=5): G = 2+6 = 8 ✅

Note: even though state 3 is visited again at t=4 (G=13 at that point), we skip it because t=4 ≠ first_visit_index[3]=2. G continues accumulating though — it reaches 20 by the time we process t=2.

6. Why Does Averaging Returns Converge?

The theoretical guarantee comes from the Law of Large Numbers:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n}X_i \xrightarrow{n\to\infty} \mathbb{E}[X]\]

Each time the agent visits state $s$ across different episodes, it generates a return $G_i$ drawn from the distribution $p(G \mid s, \pi)$. Since:

Same distribution: same policy $\pi$ + same environment dynamics → $G_i$ are identically distributed
Independent: episodes are independent of each other

The Law of Large Numbers guarantees:

\[V(s) = \frac{1}{n}\sum_{i=1}^{n} G_i \xrightarrow{n\to\infty} \mathbb{E}_\pi[G \mid S_t = s] = V^\pi(s)\]

Intuition with numbers (true value $V^*(s) = 20$, unknown to us):

After episode 1:  G=15  → estimate = 15.0
After episode 2:  G=25  → estimate = 20.0
After episode 3:  G=12  → estimate = 17.3
After episode 4:  G=28  → estimate = 20.0
...
After episode ∞:         → estimate → 20.0 ✅

Required conditions for convergence:

Every state is visited infinitely often (sufficient exploration)
Returns are i.i.d. (fixed policy, stationary environment)
Variance of G is finite (bounded rewards)

MC is essentially replacing an intractable integral with a tractable sample average:

\[\underbrace{\frac{1}{n}\sum_{i=1}^{n}G_i}_{\text{MC estimate}} \approx \underbrace{\mathbb{E}_\pi[G \mid s]}_{\text{true value (integral)}}\]

7. Why Discount Future Rewards?

The discounted return is defined as:

\[G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}, \quad \gamma \in [0, 1)\]

There are four independent justifications for this formulation:

📐 Mathematical: Guaranteeing Convergence

Without discounting ($\gamma = 1$), infinite-horizon returns can diverge:

\[G_t = R_{t+1} + R_{t+2} + \cdots = \infty\]

With $\gamma < 1$ and bounded rewards $

\leq R_{\max}$:

\[|G_t| \leq \frac{R_{\max}}{1 - \gamma} < \infty\]

💰 Economic: Time Value of Money

Directly borrowed from finance. A dollar today is worth more than a dollar tomorrow because today’s dollar can be invested. The discount rate $1 - \gamma$ plays the role of the interest rate. Future rewards are “less valuable” because they are less certain.

🎲 Probabilistic: Modeling Uncertainty

$\gamma$ can be interpreted as the probability that the episode continues to the next step:

\[\gamma = P(\text{episode continues})\]

With a $(1 - \gamma)$ chance of termination at every step, the expected contribution of a reward $k$ steps in the future is naturally $\gamma^k R_{t+k+1}$.

🧠 Behavioral: Biological Time Preference

Psychological experiments consistently show that animals (including humans) prefer immediate rewards. The classical pigeon experiment: given a choice between 1 food pellet now vs. 3 pellets in 10 seconds, pigeons often choose the immediate reward. $\gamma$ quantifies this temporal preference:

$\gamma$	Behavior
0.0	Completely myopic — only cares about immediate reward
0.9	10-step future reward weighted at $0.9^{10} \approx 0.35$
0.99	100-step future reward weighted at $0.99^{100} \approx 0.37$
1.0	Fully far-sighted — all future rewards equal weight

Summary

Concept	Key Takeaway
Motivation	Sampling from a distribution is often feasible even when the distribution has no closed form
MC applicability	Works when target distribution is known; fails when it must be learned
First-visit vs every-visit	First-visit: only count the return from the first time a state appears per episode
Common bug	Backward loop + running `visited_states` set checks future steps, not past ones
$R_{t+1}$ vs $R_t$	The reward in $(S_t, A_t, R_{t+1})$ is earned by leaving $S_t$, not entering it
Convergence	Law of Large Numbers: sample mean → true expectation as $n \to \infty$
Discounting	Mathematical necessity + economic intuition + probabilistic uncertainty + biological preference

The philosophy of Monte Carlo: You don’t need a model of the world. You just need enough experience. Statistical regularity will emerge.