Is the Convergence of Monte Carlo ES Still an Open Problem?

An investigation into one of the most fundamental open theoretical questions in reinforcement learning — the convergence of Monte Carlo Exploring Starts to the optimal fixed point.

1. The Original Question

In Chapter 5 of Reinforcement Learning: An Introduction, Sutton & Barto describe the Monte Carlo Exploring Starts (MCES) algorithm and make the following remark:

“Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved. In our opinion, this is one of the most fundamental open theoretical questions in reinforcement learning.”

The argument for why MCES should converge is intuitive:

All returns for each state–action pair are accumulated and averaged, irrespective of what policy was in force when they were observed.
If MCES converged to a suboptimal policy, the value function would eventually converge to the value function for that policy.
That would cause the policy to change (because greedy improvement over a suboptimal value function yields a different policy).
Therefore, stability is achieved only when both the policy and the value function are optimal.

The logic is compelling — yet turning it into a formal proof has proven remarkably difficult.

2. Why Is This So Hard to Prove?

The core difficulty lies in the structure of the algorithm itself:

Challenge	Explanation
Coupled dynamics	The policy and value function change simultaneously, forming a coupled dynamical system
Incomplete evaluation	Each iteration uses only a single episode to update — policy evaluation is never “finished” before improvement happens
Non-uniform update rates	Different state–action pairs are visited at different frequencies, so their estimates converge at different rates
Exploring Starts assumption	While it guarantees coverage, it doesn’t provide the uniform sampling rates needed by standard stochastic approximation theory

Standard proof techniques — contraction mappings, stochastic approximation, ODE methods — all struggle with this combination of challenges.

3. Key Papers and Partial Results

The problem has attracted significant research attention. Here is a timeline of key results:

3.1 Tsitsiklis (2002) — Modified MCES Converges

Paper: “On the Convergence of Optimistic Policy Iteration,” JMLR, 3:59–72, 2002
Result: Proved convergence of a modified version of MCES where:
- Q-function estimates are updated at the same rate for all state–action pairs
- Discount factor $\gamma < 1$
Limitation: Requires modification to the original algorithm — the equal update rate assumption is not realistic for standard MCES

3.2 Jun Liu (2021) — Undiscounted Case

Paper: “On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts,” Automatica, 2021
Result: Investigated convergence for the stochastic shortest path problem (undiscounted costs)
Significance: Complements Tsitsiklis (2002) by settling the problem for a specific undiscounted setting

3.3 Wang, Yuan, Shao & Ross (2022) — Feed-Forward MDPs

Paper: “On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning,” ICLR 2022
Result: Proved almost sure convergence of the original, unmodified MCES algorithm for Optimal Policy Feed-Forward MDPs — MDPs where states are not revisited within any episode under an optimal policy
Key insight: Uses a novel inductive proof technique based only on the strong law of large numbers, avoiding stochastic approximation entirely
Coverage: Includes all deterministic environments and many episodic environments with monotonically changing state features
Limitation: Does not cover all finite MDPs

3.4 Winnicki & Srikant (2023) — Lookahead Policy Improvement

Paper: “On the Convergence of Policy Iteration-Based Reinforcement Learning with Monte Carlo Policy Evaluation,” AISTATS 2023
Result: Proved convergence when the policy improvement step uses lookahead rather than simple greedy improvement
Limitation: Requires a modified policy improvement step, so does not resolve the original open problem

3.5 Winnicki & Srikant (2024) — TD-Learning Variant

Paper: “Convergence of Monte Carlo Exploring Starts with TD-Learning,” IEEE CDC 2024
Result: Studied convergence of MCES when TD($\lambda$) is used for policy evaluation combined with lookahead for policy improvement

3.6 Chen, Ross & Youssef (2024) — Finite-Sample Analysis

Paper: “Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning,” arXiv:2410.02994, 2024
Result: Goes beyond convergence to analyze the convergence rate of a modified MCES, proving sample complexity of $\tilde{O}(SAK^3 \log^3(1/\delta))$ episodes for the stochastic shortest path problem

4. Current Status at a Glance

Variant	Status	Reference
Unmodified MCES + Arbitrary finite MDPs	Still Open	Sutton & Barto (2018)
Modified MCES (equal update rates, $\gamma < 1$)	Proved	Tsitsiklis (2002)
Unmodified MCES + Feed-Forward MDPs	Proved	Wang et al. (ICLR 2022)
MCES + Lookahead policy improvement	Proved	Winnicki & Srikant (AISTATS 2023)
MCES + TD($\lambda$) + Lookahead	Proved	Winnicki & Srikant (CDC 2024)
Undiscounted / Stochastic shortest path	Proved	Liu (Automatica 2021)
Finite-sample bounds (modified MCES)	Proved	Chen et al. (2024)

5. Why It Matters

This isn’t just a theoretical curiosity. The MCES convergence question sits at the heart of Generalized Policy Iteration (GPI) — the fundamental paradigm underlying most RL algorithms. GPI interleaves incomplete policy evaluation with policy improvement, and proving that this process converges in the simplest tabular setting (Monte Carlo + greedy improvement) would provide foundational assurance for the entire family of RL methods.

The fact that MCES always converges in practice despite lacking a proof makes this a fascinating gap between theory and practice — reminiscent of how the simplex method in linear programming worked reliably for decades before smoothed analysis explained its behavior.

6. Conclusion

As of 2026, the original question posed by Sutton & Barto — whether the unmodified Monte Carlo Exploring Starts algorithm with greedy policy improvement converges to the optimal policy for all finite MDPs — remains an open problem. However, the research community has made substantial progress:

Convergence has been proved for important subclasses (feed-forward MDPs, stochastic shortest paths)
Modified versions of the algorithm have been shown to converge
Finite-sample complexity bounds have been established for variants

The problem is now much better understood than when it was first posed, but the general proof continues to elude us — making it one of the most enduring open questions in reinforcement learning theory.

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Tsitsiklis, J. N. (2002). On the Convergence of Optimistic Policy Iteration. JMLR, 3, 59–72.
Wang, J., Yuan, B., Shao, H., & Ross, S. (2022). On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning. ICLR 2022.
Liu, J. (2021). On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts. Automatica.
Winnicki, A., & Srikant, R. (2023). On the Convergence of Policy Iteration-Based Reinforcement Learning with Monte Carlo Policy Evaluation. AISTATS 2023.
Winnicki, A., & Srikant, R. (2024). Convergence of Monte Carlo Exploring Starts with TD-Learning. IEEE CDC 2024.
Chen, W., Ross, S., & Youssef, A. (2024). Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning. arXiv:2410.02994.