Stochastic Approximation
for Reinforcement Learning
§1 Motivation: Why SA Appears in RL
The Fundamental Problem
In RL, we want to solve the Bellman equation for the value function \(V^\pi\). But we don't have the transition model \(P\). We only have samples: transitions \((s, a, r, s')\) from interacting with the environment.
Think of finding the mean of a distribution without knowing the distribution — you just draw samples. You could estimate the mean by averaging samples. Now replace "mean" with "value function" and "samples" with "environment transitions." That's the essence of stochastic approximation in RL.
The Bellman equation \(V = \mathcal{T}V\) is a fixed-point equation. We want to solve it, but applying \(\mathcal{T}\) exactly requires knowing the full model. Instead, we can apply a noisy version of \(\mathcal{T}\) using samples — and iterating this noisy update is precisely stochastic approximation.
The Bellman equation for policy \(\pi\) is:
This is a root-finding problem: find \(V^\pi\) such that \(h(V^\pi) = 0\) where \(h(V) = \mathcal{T}^\pi V - V\). We cannot compute \(h(V)\) exactly — but given a sample \((s, a, r, s')\), we can form a noisy estimate:
Stochastic approximation iterates \(V \leftarrow V + \alpha \hat{h}(V)\) and asks: does this converge to the root of \(h\)?
Three Roles of SA in RL
§2 The SA View of RL
RL Algorithms as Noisy Fixed-Point Iteration
The Bellman optimality operator \(\mathcal{T}^*\) has a unique fixed point \(Q^*\). Exact value iteration applies \(\mathcal{T}^*\) repeatedly. But we can't compute \(\mathcal{T}^* Q\) without the model. Instead, RL algorithms apply noisy estimates.
Consider the goal: find \(\theta^*\) such that \(h(\theta^*) = 0\). We can't evaluate \(h\) exactly, but we can draw a noisy sample \(H(\theta, \xi)\) where \(\mathbb{E}[H(\theta,\xi) | \theta] = h(\theta)\). The SA iteration is:
\(\theta_{t+1} = \theta_t + \alpha_t H(\theta_t, \xi_t)\)
For RL: the mean field \(h(\theta)\) encodes what the algorithm is trying to do (project toward Bellman fixed point), and the noise \(H(\theta,\xi) - h(\theta)\) comes from sample randomness and Markov chain mixing.
| SA Component | TD(0) | Q-Learning |
|---|---|---|
| \(\theta_t\) | Weight vector \(w_t\) | Q-table \(Q_t\) |
| \(h(\theta)\) | \(\Phi^\top D(\mathcal{T}^\pi \hat{V} - \hat{V})\) | \(\mathbb{E}[\delta \mid Q]\) |
| Noise \(M_{t+1}\) | TD error minus its expectation | Sample Bellman residual |
| Fixed point \(\theta^*\) | TD fixed point \(A^{-1}b\) | \(Q^*\) (optimal Q) |
| Step size \(\alpha_t\) | \(1/t\) or constant | Same schedule |
§3 Canonical Stochastic Approximation Form
Breaking Down Each Component
1. The Iterate \(\theta_t\)
The parameter being learned. In tabular RL, this is the Q-table or value table. In linear function approximation, it's the weight vector \(w \in \mathbb{R}^d\). The index \(t\) tracks the number of updates (each new sample gives one update).
2. The Mean Field \(h(\theta)\)
The mean fieldAlso called the "drift" or "expected update direction." It is h(θ) = 𝔼[H(θ,ξ) | θ] — what the algorithm would converge to if we could take infinitely many samples at a fixed θ. The ODE dθ/dt = h(θ) governs the macroscopic dynamics. \(h(\theta) = \mathbb{E}[H(\theta_t, \xi_t) \mid \theta_t = \theta]\) is the expected update direction. It encodes the algorithm's intent — where it wants to push \(\theta\). For TD(0), \(h(\theta) = A\theta - b\) (see §8). For convergence, we need \(h(\theta^*) = 0\) at the desired fixed point.
3. The Noise \(M_{t+1}\)
The martingale difference noiseA sequence {M_t} is martingale difference if 𝔼[M_{t+1} | ℱ_t] = 0, where ℱ_t is the filtration up to time t. This means noise is mean-zero conditionally, but can be correlated across time (unlike i.i.d.). is \(M_{t+1} = H(\theta_t, \xi_t) - h(\theta_t)\). Key property: \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\), making it a martingale difference sequence (MDS). This zero-mean conditional property is crucial — it ensures noise doesn't systematically bias the algorithm.
§4 Robbins-Monro: The Foundation
The original 1951 paper by Robbins and Monro addressed a specific problem: find \(\theta^*\) such that \(h(\theta^*) = \alpha\) (a root-finding problem), given only noisy observations of \(h\).
Imagine you're calibrating a chemical process. You want to find the temperature \(\theta^*\) that gives a reaction yield of exactly \(\alpha = 50\%\). You can't measure yield exactly — your measurements are noisy. RM says: start with some \(\theta_0\), measure (noisily), and adjust \(\theta\) in the direction that reduces the error. Take smaller and smaller steps over time. Under appropriate conditions, you converge to \(\theta^*\).
For RL: the "reaction yield" is the expected TD error, and we're finding the weights \(\theta^*\) that make it zero.
Problem: Find \(\theta^* \in \mathbb{R}\) such that \(h(\theta^*) = 0\), given access only to noisy observations \(Y(\theta) = h(\theta) + \epsilon\) where \(\mathbb{E}[\epsilon] = 0\).
RM Iteration:
Original conditions (simplified):
- \(h\) is monotone decreasing, \(h(\theta^*) = 0\)
- \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)
- \(\mathbb{E}[\epsilon_t^2] \leq C(1 + \theta_t^2)\) (bounded noise variance)
Result: \(\theta_t \to \theta^*\) almost surely.
Why This Matters for RL
RL generalizes RM in two critical ways: (1) the noise is Markov-chain noise, not i.i.d., and (2) the parameter \(\theta\) is a vector (or function). The RM insight — decreasing step sizes + mean-zero noise → convergence — survives both generalizations under appropriate conditions, but requires much more sophisticated proof tools (the ODE method, §6).
§5 Martingale Difference Noise
Definition
A martingale difference sequence (MDS) is a sequence of random variables \(\{M_t\}\) where each term has mean zero given everything that happened before. It's a generalization of i.i.d. zero-mean noise that allows for temporal dependence.
Gambling analogy: Your net gain per round in a fair casino is an MDS — it can depend on how much you bet (which depends on past results), but its conditional expectation given your history is always zero (fair game). Past wins don't predict future gains in expectation.
Let \((\mathcal{F}_t)_{t \geq 0}\) be a filtration (increasing sequence of σ-algebras encoding the history). A sequence \(\{M_t\}\) is a martingale difference sequence w.r.t. \(\mathcal{F}\) if:
and \(M_{t+1}\) is \(\mathcal{F}_{t+1}\)-measurable. The key property for SA: MDS noise does not bias the algorithm's direction, but it can slow convergence based on its variance.
Why RL Noise is NOT i.i.d.
In RL, samples \((s_t, a_t, r_t, s_{t+1})\) come from a Markov chain. The states \(s_t\) are correlated — today's state depends on yesterday's transition. This means:
The Martingale Decomposition
The SA update \(H(\theta_t, \xi_t)\) decomposes as:
where \(M_{t+1} = H(\theta_t, \xi_t) - h(\theta_t)\). Since \(\mathbb{E}[H(\theta_t, \xi_t) \mid \mathcal{F}_t] = h(\theta_t)\), we have \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\). The MDS structure is essential — it means noise is unbiased conditional on current information, even if correlated across time.
Theorem (Martingale Convergence): Let \(\{M_t\}\) be a martingale with \(\sup_t \mathbb{E}[M_t^2] < \infty\). Then \(M_t \to M_\infty\) a.s. for some finite random variable \(M_\infty\).
This is used via the Robbins-Siegmund Theorem: if \(V_t \geq 0\) satisfies \(\mathbb{E}[V_{t+1} \mid \mathcal{F}_t] \leq (1+a_t)V_t - b_t + c_t\) with \(\sum a_t < \infty\), \(\sum c_t < \infty\), then \(V_t \to V_\infty\) a.s. and \(\sum b_t < \infty\) a.s. In SA convergence proofs, \(V_t = \|\theta_t - \theta^*\|^2\) and the theorem yields convergence of the squared distance.
§6 The ODE Method ★
This is the most powerful tool for proving convergence of SA algorithms in RL. Almost every RL convergence result uses it.
Intuition: SA as a Noisy ODE Solver
Consider the discrete SA iteration \(\theta_{t+1} = \theta_t + \alpha_t(h(\theta_t) + M_{t+1})\). Rewrite as:
\(\frac{\theta_{t+1} - \theta_t}{\alpha_t} = h(\theta_t) + M_{t+1}\)
As step sizes \(\alpha_t \to 0\), the noise averages out (LLN), and this looks like the Euler discretization of the ODE:
The ODE method (Ljung 1977, Kushner & Clark 1978) makes this precise: the interpolated trajectory of the SA iterates converges to a solution of the limiting ODE. So to understand where SA converges, just analyze the ODE!
Setup: Define the continuous-time interpolation \(\bar{\theta}(t)\) of the SA iterates by: \(\bar{\theta}(t) = \theta_n\) for \(t \in [t_n, t_{n+1})\) where \(t_n = \sum_{k=0}^{n-1} \alpha_k\).
ODE Method Theorem (informal, Borkar & Meyn 2000): Under appropriate conditions, for any \(T > 0\):
where \(\phi(t)\) is the solution of \(\dot{\phi} = h(\phi)\) with \(\phi(0) = \lim_n \theta_n\). The SA iterates track the ODE solution on compact time intervals, up to vanishing error from noise and discretization.
The Limiting ODE and Its Attractors
The SA iteration converges to an attractor of the ODE \(\dot{\theta} = h(\theta)\). An attractor is a set \(A\) such that:
- Every ODE trajectory starting near \(A\) stays near \(A\) and converges to it.
- \(A\) is the smallest such set (not reducible to a smaller attractor).
For SA convergence to a single point \(\theta^*\), we need \(\theta^*\) to be a globally asymptotically stable equilibrium of the ODE — every trajectory converges to it.
Lyapunov Stability: The Key Tool
A Lyapunov functionA function V(θ) that acts like "energy" for the ODE. Energy decreases along trajectories until the equilibrium is reached. Lyapunov's method: if you can find such a V, stability is proven without solving the ODE explicitly. is an "energy function" \(V(\theta) \geq 0\) that decreases along the ODE trajectories. Think of it like a ball rolling downhill — if you have a bowl-shaped energy landscape, the ball always rolls to the bottom (the equilibrium).
For TD with linear FA: \(V(w) = \|w - w^*\|^2_M\) for an appropriate norm \(M\) serves as a Lyapunov function. Its time derivative along the ODE \(\dot{w} = -Aw + b = -A(w-w^*)\) is \(\dot{V} = -2(w-w^*)^\top M A (w-w^*) \leq 0\) when \(MA\) is positive definite.
Lyapunov Stability Theorem: Consider \(\dot{\theta} = h(\theta)\) with equilibrium \(\theta^*\) (i.e., \(h(\theta^*)=0\)). If there exists \(V: \mathbb{R}^d \to \mathbb{R}_+\) such that:
then \(\theta^*\) is globally asymptotically stable. Every trajectory of the ODE converges to \(\theta^*\), and hence (by ODE method) the SA iterates converge a.s.
Standard choice for linear SA: \(V(\theta) = (\theta - \theta^*)^\top M (\theta - \theta^*)\) for positive definite \(M\).
The standard proof has four steps:
- Identify the mean field: Compute \(h(\theta) = \mathbb{E}[H(\theta,\xi) \mid \theta]\). Show \(h(\theta^*) = 0\).
- Verify noise is MDS: Show \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\) and \(\mathbb{E}[\|M_{t+1}\|^2 \mid \mathcal{F}_t] \leq C(1 + \|\theta_t\|^2)\).
- Bounded iterates: Prove \(\sup_t \|\theta_t\| < \infty\) a.s. (often via a Lyapunov argument on the stochastic system, using projected SA or showing instability implies the algorithm escapes to ∞ with zero probability).
- ODE stability: Construct Lyapunov function \(V\), show \(\dot{V} < 0\), conclude \(\theta^* \) is GAS, invoke ODE method.
§7 Convergence Conditions
These conditions are necessary — skipping any one can break convergence. Here's exactly why each is required.
If only \(\sum \alpha_t = \infty\): Take constant step sizes \(\alpha_t = c > 0\). Then \(\sum \alpha_t = \infty\) but \(\sum \alpha_t^2 = \infty\). The iterates oscillate around \(\theta^*\) due to persistent noise, never settling. This is why constant-step-size SA converges only to a neighborhood, not to \(\theta^*\) exactly.
If only \(\sum \alpha_t^2 < \infty\): Take \(\alpha_t = 1/t^2\). Then \(\sum 1/t^2 < \infty\) but \(\sum 1/t^2 < \infty \Rightarrow \sum \alpha_t < \infty\). The algorithm can't travel far enough to correct a bad initialization. Convergence is not guaranteed from arbitrary starting points.
The Robbins-Monro conditions \(\sum \alpha_t = \infty, \sum \alpha_t^2 < \infty\) are the sweet spot: enough step-size to get anywhere, but decreasing fast enough to suppress noise.
§8 TD Learning as Stochastic Approximation ★
This is the central connection. We derive explicitly how TD(0) fits the canonical SA form.
The TD(0) Update
Identifying the SA Components
The TD update can be written as \(w_{t+1} = w_t + \alpha_t H(w_t, s_t, s_{t+1})\) where:
Taking the expectation over \((s, s') \sim d^\pi \times P^\pi\) (stationary distribution × transition):
where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\) and \(b = \Phi^\top Dr\). The limiting ODE is:
Fixed point: \(w^* = A^{-1}b\) (the TD fixed point). The ODE converges iff \(A\) is positive definite.
Define \(D = \text{diag}(d^\pi)\) (stationary distribution under \(\pi\)), \(\Phi \in \mathbb{R}^{|\mathcal{S}| \times d}\) (feature matrix), \(P^\pi\) (transition matrix under \(\pi\)).
Noise decomposition:
Check: \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = \mathbb{E}[H(w_t, s_t, s_{t+1}) \mid w_t, s_t] - h(w_t) = h(w_t) - h(w_t) = 0\). ✓ MDS property holds.
Stability of ODE: \(A\) is PD under on-policy sampling because \(D\) is the stationary distribution, which satisfies \(d^\top P^\pi = d^\top\), making \(A\) expressible as a positive definite matrix via the detailed balance-like argument.
§9 Q-Learning as Stochastic Approximation
The Tabular Q-Learning Update
Identifying the SA Structure
Q-learning is a nonlinear SA due to the max operator. The mean field is:
where \(\mathcal{T}^*\) is the Bellman optimality operator. Fixed point: \(\mathcal{T}^* Q^* = Q^*\), so \(h(Q^*) = 0\). ✓
The Bellman optimality operator \(\mathcal{T}^*\) is a \(\gamma\)-contraction in \(L^\infty\): \(\|\mathcal{T}^* Q - \mathcal{T}^* Q'\|_\infty \leq \gamma \|Q - Q'\|_\infty\). This means the ODE \(\dot{Q} = h(Q) = \mathcal{T}^*Q - Q\) has \(Q^*\) as its unique globally asymptotically stable equilibrium. The Lyapunov function \(V(Q) = \|Q - Q^*\|_\infty\) works: \(\dot{V} \leq -(1-\gamma)\|Q - Q^*\|_\infty < 0\).
With function approximation \(Q(s,a;\theta)\), the fixed-point equation becomes \(\Pi \mathcal{T}^* \hat{Q}(\cdot;\theta) = \hat{Q}(\cdot;\theta)\), where \(\Pi\) is the projection onto the function class. The composed operator \(\Pi \mathcal{T}^*\) is NOT necessarily a contraction! This is because:
But the norm must be the same for this to work. Under the \(\mu\)-weighted norm, \(\|\Pi\|_\mu \leq 1\) but \(\|\mathcal{T}^*\|\) is controlled in \(L^\infty\), not \(L^2_\mu\). The norms don't match, and the composition can be non-contractive → divergence possible.
Watkins and Dayan proved tabular Q-learning converges a.s. if (1) all state-action pairs are visited infinitely often, (2) step sizes satisfy RM conditions. The proof uses the contraction property of \(\mathcal{T}^*\) directly:
Let \(\Delta_t(s,a) = Q_t(s,a) - Q^*(s,a)\). Define \(F_t = \max_{s,a} |\Delta_t(s,a)|\). The Q-learning update gives:
Taking sup and applying the Robbins-Siegmund theorem with \(V_t = F_t\), one shows \(F_t \to 0\) a.s., i.e., \(Q_t \to Q^*\) uniformly.
§10 Two-Timescale Stochastic Approximation
The Core Idea: Fast and Slow Iterates
When two interacting SA recursions need to be run simultaneously — and one converges much faster than the other — two-timescale SA provides the right framework.
Imagine a manager (actor) and an analyst (critic). The analyst computes reports (value estimates) much faster than the manager makes strategy decisions (policy updates). When the manager looks at the analyst's latest report, it's essentially the optimal report for the current strategy — because the analyst has had time to converge.
This timescale separation lets us decouple two coupled SA iterations: the fast one sees the slow one as approximately constant, and the slow one sees the fast one as already at its equilibrium.
Two-timescale SA (Borkar, 1997):
with \(\beta_t / \alpha_t \to 0\) (i.e., \(\alpha_t \gg \beta_t\)).
Theorem (Borkar 1997): Under appropriate conditions, the coupled system converges as if:
- The fast iterate \(w_t\) tracks its ODE \(\dot{w} = h_1(w, \theta)\) with \(\theta\) frozen.
- It converges to \(\lambda(\theta)\) (the fast-system equilibrium given \(\theta\)).
- The slow iterate \(\theta_t\) then tracks the "reduced ODE" \(\dot{\theta} = h_2(\lambda(\theta), \theta)\).
Actor-Critic as Two-Timescale SA
§11 Where SA Breaks in RL
The Three Failure Modes
1. Function Approximation Changes the Mean Field
With linear FA, the TD update projects the Bellman target onto the function class. The mean field becomes \(h(w) = b - Aw\) where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\). The critical question: is \(A\) positive definite? Under on-policy sampling, yes. Under off-policy, no — and SA with non-PD drift matrices can diverge (exploding iterates).
2. Off-Policy Sampling Breaks the MDS Structure
When using a behavior policy \(\mu \neq \pi\), the stationary distribution in the mean field is \(d^\mu\), but the Bellman equation targets \(V^\pi\). The matrix \(A\) becomes \(A = \Phi^\top D^\mu (I - \gamma P^\pi)\Phi\), which is not guaranteed PD because \(D^\mu\) and \(P^\pi\) correspond to different policies. The ODE can have unstable equilibria.
3. Nonlinear FA: No Fixed-Point Guarantee
With neural networks, \(\hat{Q}(\cdot;\theta)\) is nonlinear in \(\theta\). The mean field \(h(\theta)\) may have multiple fixed points, saddle points, or no fixed point at all within the function class. The ODE analysis becomes intractable without strong assumptions on the network architecture and initialization.
- FA: Mean field \(h\) may not have a stable fixed point in the function class
- Bootstrapping: The target \(r + \gamma \hat{V}(s';\theta)\) depends on \(\theta\), making the mean field calculation involve the current iterate — not just data
- Off-policy: The noise \(M_{t+1}\) is no longer MDS under the stationary distribution of \(\pi\) — the MDS property requires the distribution used for sampling to match the policy being evaluated
GTD (Sutton et al. 2009) fixes the off-policy SA breakdown by minimizing the Projected Bellman Error (PBE): \(\overline{PBE}(\theta) = \|\Pi(\mathcal{T}^\pi \hat{V}(\cdot;\theta) - \hat{V}(\cdot;\theta))\|^2_\mu\). This is a true objective with a well-defined gradient. The SA update is:
where \(\rho_t = \pi(a_t|s_t)/\mu(a_t|s_t)\) is the IS ratio. This is a two-timescale SA with provable convergence under off-policy sampling — the mean field now corresponds to the gradient of PBE, which always points toward the minimum.
§12 Practical Insights from SA Theory
Learning Rate Selection
| Schedule | Formula | Properties | Use Case |
|---|---|---|---|
| Polynomial decay | \(\alpha_t = c/(t+t_0)^\rho\) | RM conditions for \(\rho \in (0.5, 1]\) | Tabular RL, theoretical guarantees |
| Constant | \(\alpha_t = \alpha\) | Converges to neighborhood \(\mathcal{O}(\alpha)\) | Deep RL, non-stationary targets |
| Square root | \(\alpha_t = c/\sqrt{t}\) | RM conditions, slower than 1/t | Compromise: faster early, proper decay |
| Adam/RMSProp | Adaptive per-parameter | No formal RM guarantees in RL setting | Deep RL practice (DQN, PPO, SAC) |
Why Divergence Happens in Practice
- Identify \(h(\theta)\) — what does the algorithm optimize in expectation?
- Check: does \(h\) have a stable fixed point? (Use Lyapunov or PD matrix argument)
- Is noise MDS? (Check: is sampling on-policy? Are IS ratios bounded?)
- Are iterates bounded? (Project onto compact set or verify via Lyapunov)
- Do step sizes satisfy RM? (Use \(1/(t+1)\) for theory, tune constant for practice)
- If two coupled recursions: is there timescale separation? (\(\beta_t/\alpha_t \to 0\))