PhD Toolkit · Stochastic Approximation

Stochastic Approximation
for Reinforcement Learning

RL Theory Interactive Convergence Analysis ~50 min read
A precise, RL-focused treatment of stochastic approximation — covering exactly the tools needed to understand TD convergence, interpret RL algorithms as SA recursions, and read RL theory papers. Every concept is tied directly to an RL algorithm.

§1 Motivation: Why SA Appears in RL

The Fundamental Problem

In RL, we want to solve the Bellman equation for the value function \(V^\pi\). But we don't have the transition model \(P\). We only have samples: transitions \((s, a, r, s')\) from interacting with the environment.

Think of finding the mean of a distribution without knowing the distribution — you just draw samples. You could estimate the mean by averaging samples. Now replace "mean" with "value function" and "samples" with "environment transitions." That's the essence of stochastic approximation in RL.

The Bellman equation \(V = \mathcal{T}V\) is a fixed-point equation. We want to solve it, but applying \(\mathcal{T}\) exactly requires knowing the full model. Instead, we can apply a noisy version of \(\mathcal{T}\) using samples — and iterating this noisy update is precisely stochastic approximation.

The Bellman equation for policy \(\pi\) is:

$$V^\pi(s) = \sum_a \pi(a|s)\left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')\right]$$

This is a root-finding problem: find \(V^\pi\) such that \(h(V^\pi) = 0\) where \(h(V) = \mathcal{T}^\pi V - V\). We cannot compute \(h(V)\) exactly — but given a sample \((s, a, r, s')\), we can form a noisy estimate:

$$\hat{h}(V)(s) = r + \gamma V(s') - V(s) \quad \text{(TD error)}$$

Stochastic approximation iterates \(V \leftarrow V + \alpha \hat{h}(V)\) and asks: does this converge to the root of \(h\)?

🎯 The SA Paradigm in RL
Every classical RL algorithm (TD, Q-learning, Actor-Critic) is a stochastic approximation algorithm. Understanding SA is understanding why these algorithms converge.

Three Roles of SA in RL

🔎
Root Finding
Solve Bellman equations via noisy fixed-point iteration. TD learning is exactly this.
📉
Optimization
Minimize value error via stochastic gradient descent. Policy gradient methods.
Estimation
Estimate expectations from Markov chain samples. The noise structure requires careful analysis.

§2 The SA View of RL

RL Algorithms as Noisy Fixed-Point Iteration

The Bellman optimality operator \(\mathcal{T}^*\) has a unique fixed point \(Q^*\). Exact value iteration applies \(\mathcal{T}^*\) repeatedly. But we can't compute \(\mathcal{T}^* Q\) without the model. Instead, RL algorithms apply noisy estimates.

Consider the goal: find \(\theta^*\) such that \(h(\theta^*) = 0\). We can't evaluate \(h\) exactly, but we can draw a noisy sample \(H(\theta, \xi)\) where \(\mathbb{E}[H(\theta,\xi) | \theta] = h(\theta)\). The SA iteration is:

\(\theta_{t+1} = \theta_t + \alpha_t H(\theta_t, \xi_t)\)

For RL: the mean field \(h(\theta)\) encodes what the algorithm is trying to do (project toward Bellman fixed point), and the noise \(H(\theta,\xi) - h(\theta)\) comes from sample randomness and Markov chain mixing.

SA ComponentTD(0)Q-Learning
\(\theta_t\)Weight vector \(w_t\)Q-table \(Q_t\)
\(h(\theta)\)\(\Phi^\top D(\mathcal{T}^\pi \hat{V} - \hat{V})\)\(\mathbb{E}[\delta \mid Q]\)
Noise \(M_{t+1}\)TD error minus its expectationSample Bellman residual
Fixed point \(\theta^*\)TD fixed point \(A^{-1}b\)\(Q^*\) (optimal Q)
Step size \(\alpha_t\)\(1/t\) or constantSame schedule

§3 Canonical Stochastic Approximation Form

\(\theta_{t+1} = \theta_t + \alpha_t \left[ h(\theta_t) + M_{t+1} \right]\)
\(\theta_t\) — parameter / iterate
\(\alpha_t\) — step size schedule
\(h(\theta)\) — mean field / drift
\(M_{t+1}\) — martingale noise

Breaking Down Each Component

1. The Iterate \(\theta_t\)

The parameter being learned. In tabular RL, this is the Q-table or value table. In linear function approximation, it's the weight vector \(w \in \mathbb{R}^d\). The index \(t\) tracks the number of updates (each new sample gives one update).

2. The Mean Field \(h(\theta)\)

The mean fieldAlso called the "drift" or "expected update direction." It is h(θ) = 𝔼[H(θ,ξ) | θ] — what the algorithm would converge to if we could take infinitely many samples at a fixed θ. The ODE dθ/dt = h(θ) governs the macroscopic dynamics. \(h(\theta) = \mathbb{E}[H(\theta_t, \xi_t) \mid \theta_t = \theta]\) is the expected update direction. It encodes the algorithm's intent — where it wants to push \(\theta\). For TD(0), \(h(\theta) = A\theta - b\) (see §8). For convergence, we need \(h(\theta^*) = 0\) at the desired fixed point.

3. The Noise \(M_{t+1}\)

The martingale difference noiseA sequence {M_t} is martingale difference if 𝔼[M_{t+1} | ℱ_t] = 0, where ℱ_t is the filtration up to time t. This means noise is mean-zero conditionally, but can be correlated across time (unlike i.i.d.). is \(M_{t+1} = H(\theta_t, \xi_t) - h(\theta_t)\). Key property: \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\), making it a martingale difference sequence (MDS). This zero-mean conditional property is crucial — it ensures noise doesn't systematically bias the algorithm.

🔑 Why This Form is Universal
Every stochastic gradient descent, TD update, Q-learning update, and actor-critic update fits exactly this canonical form. The art is identifying \(h\), characterizing \(M_{t+1}\), and verifying convergence conditions on both.

§4 Robbins-Monro: The Foundation

The original 1951 paper by Robbins and Monro addressed a specific problem: find \(\theta^*\) such that \(h(\theta^*) = \alpha\) (a root-finding problem), given only noisy observations of \(h\).

Imagine you're calibrating a chemical process. You want to find the temperature \(\theta^*\) that gives a reaction yield of exactly \(\alpha = 50\%\). You can't measure yield exactly — your measurements are noisy. RM says: start with some \(\theta_0\), measure (noisily), and adjust \(\theta\) in the direction that reduces the error. Take smaller and smaller steps over time. Under appropriate conditions, you converge to \(\theta^*\).

For RL: the "reaction yield" is the expected TD error, and we're finding the weights \(\theta^*\) that make it zero.

Problem: Find \(\theta^* \in \mathbb{R}\) such that \(h(\theta^*) = 0\), given access only to noisy observations \(Y(\theta) = h(\theta) + \epsilon\) where \(\mathbb{E}[\epsilon] = 0\).

RM Iteration:

$$\theta_{t+1} = \theta_t + \alpha_t Y(\theta_t) = \theta_t + \alpha_t(h(\theta_t) + \epsilon_t)$$

Original conditions (simplified):

  • \(h\) is monotone decreasing, \(h(\theta^*) = 0\)
  • \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)
  • \(\mathbb{E}[\epsilon_t^2] \leq C(1 + \theta_t^2)\) (bounded noise variance)

Result: \(\theta_t \to \theta^*\) almost surely.

Why This Matters for RL

RL generalizes RM in two critical ways: (1) the noise is Markov-chain noise, not i.i.d., and (2) the parameter \(\theta\) is a vector (or function). The RM insight — decreasing step sizes + mean-zero noise → convergence — survives both generalizations under appropriate conditions, but requires much more sophisticated proof tools (the ODE method, §6).

Robbins-Monro Convergence
SA iteration finding root of h(θ)
h(θ) function: Noise σ: σ=1.0

§5 Martingale Difference Noise

Definition

A martingale difference sequence (MDS) is a sequence of random variables \(\{M_t\}\) where each term has mean zero given everything that happened before. It's a generalization of i.i.d. zero-mean noise that allows for temporal dependence.

Gambling analogy: Your net gain per round in a fair casino is an MDS — it can depend on how much you bet (which depends on past results), but its conditional expectation given your history is always zero (fair game). Past wins don't predict future gains in expectation.

Let \((\mathcal{F}_t)_{t \geq 0}\) be a filtration (increasing sequence of σ-algebras encoding the history). A sequence \(\{M_t\}\) is a martingale difference sequence w.r.t. \(\mathcal{F}\) if:

$$\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0 \quad \text{almost surely, for all } t \geq 0$$

and \(M_{t+1}\) is \(\mathcal{F}_{t+1}\)-measurable. The key property for SA: MDS noise does not bias the algorithm's direction, but it can slow convergence based on its variance.

Why RL Noise is NOT i.i.d.

In RL, samples \((s_t, a_t, r_t, s_{t+1})\) come from a Markov chain. The states \(s_t\) are correlated — today's state depends on yesterday's transition. This means:

🔗
Temporal Correlation
Consecutive samples from the same trajectory are correlated. The noise at time t depends on s_t, which depends on s_{t-1}.
📊
Non-Stationary Distribution
Early in training, the agent visits different states than later. The noise distribution shifts as the policy improves.
Markov Noise
The noise at time t is a function of (s_t, s_{t+1}) which form a Markov chain. Analysis requires mixing time arguments.
⚠ Markov Noise: A Key Subtlety
In TD learning, the noise \(M_{t+1} = \delta_t - \mathbb{E}[\delta_t \mid s_t]\) is an MDS because \(\mathbb{E}[M_{t+1} \mid s_t, w_t] = 0\). But the full sequence \((s_t, r_t, s_{t+1})\) is Markov-chain dependent, not i.i.d. Convergence proofs must handle this Markov noise structure.

The Martingale Decomposition

The SA update \(H(\theta_t, \xi_t)\) decomposes as:

$$H(\theta_t, \xi_t) = \underbrace{h(\theta_t)}_{\text{mean field}} + \underbrace{M_{t+1}}_{\text{MDS noise}}$$

where \(M_{t+1} = H(\theta_t, \xi_t) - h(\theta_t)\). Since \(\mathbb{E}[H(\theta_t, \xi_t) \mid \mathcal{F}_t] = h(\theta_t)\), we have \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\). The MDS structure is essential — it means noise is unbiased conditional on current information, even if correlated across time.

Theorem (Martingale Convergence): Let \(\{M_t\}\) be a martingale with \(\sup_t \mathbb{E}[M_t^2] < \infty\). Then \(M_t \to M_\infty\) a.s. for some finite random variable \(M_\infty\).

This is used via the Robbins-Siegmund Theorem: if \(V_t \geq 0\) satisfies \(\mathbb{E}[V_{t+1} \mid \mathcal{F}_t] \leq (1+a_t)V_t - b_t + c_t\) with \(\sum a_t < \infty\), \(\sum c_t < \infty\), then \(V_t \to V_\infty\) a.s. and \(\sum b_t < \infty\) a.s. In SA convergence proofs, \(V_t = \|\theta_t - \theta^*\|^2\) and the theorem yields convergence of the squared distance.

§6 The ODE Method ★

This is the most powerful tool for proving convergence of SA algorithms in RL. Almost every RL convergence result uses it.

Intuition: SA as a Noisy ODE Solver

Consider the discrete SA iteration \(\theta_{t+1} = \theta_t + \alpha_t(h(\theta_t) + M_{t+1})\). Rewrite as:

\(\frac{\theta_{t+1} - \theta_t}{\alpha_t} = h(\theta_t) + M_{t+1}\)

As step sizes \(\alpha_t \to 0\), the noise averages out (LLN), and this looks like the Euler discretization of the ODE:

$$\dot{\theta}(t) = h(\theta(t))$$

The ODE method (Ljung 1977, Kushner & Clark 1978) makes this precise: the interpolated trajectory of the SA iterates converges to a solution of the limiting ODE. So to understand where SA converges, just analyze the ODE!

Setup: Define the continuous-time interpolation \(\bar{\theta}(t)\) of the SA iterates by: \(\bar{\theta}(t) = \theta_n\) for \(t \in [t_n, t_{n+1})\) where \(t_n = \sum_{k=0}^{n-1} \alpha_k\).

ODE Method Theorem (informal, Borkar & Meyn 2000): Under appropriate conditions, for any \(T > 0\):

$$\sup_{0 \leq t \leq T} \|\bar{\theta}(t + t_n) - \phi(t)\| \to 0 \quad \text{a.s.}$$

where \(\phi(t)\) is the solution of \(\dot{\phi} = h(\phi)\) with \(\phi(0) = \lim_n \theta_n\). The SA iterates track the ODE solution on compact time intervals, up to vanishing error from noise and discretization.

SA Iterates vs ODE Trajectory
SA (noisy dots) converges to ODE solution (smooth curve)
Noise level: σ=0.8 Step: α_t =

The Limiting ODE and Its Attractors

The SA iteration converges to an attractor of the ODE \(\dot{\theta} = h(\theta)\). An attractor is a set \(A\) such that:

  1. Every ODE trajectory starting near \(A\) stays near \(A\) and converges to it.
  2. \(A\) is the smallest such set (not reducible to a smaller attractor).

For SA convergence to a single point \(\theta^*\), we need \(\theta^*\) to be a globally asymptotically stable equilibrium of the ODE — every trajectory converges to it.

💡 Why This is Powerful for RL
To prove TD(0) converges, we just need to show the ODE \(\dot{w} = -Aw + b\) (where \(A = \Phi^\top D(I-\gamma P^\pi)\Phi\) and \(b = \Phi^\top Dr\)) has a globally stable equilibrium. This reduces to showing \(A\) is positive definite — a linear algebra fact that follows from the on-policy stationary distribution.

Lyapunov Stability: The Key Tool

A Lyapunov functionA function V(θ) that acts like "energy" for the ODE. Energy decreases along trajectories until the equilibrium is reached. Lyapunov's method: if you can find such a V, stability is proven without solving the ODE explicitly. is an "energy function" \(V(\theta) \geq 0\) that decreases along the ODE trajectories. Think of it like a ball rolling downhill — if you have a bowl-shaped energy landscape, the ball always rolls to the bottom (the equilibrium).

For TD with linear FA: \(V(w) = \|w - w^*\|^2_M\) for an appropriate norm \(M\) serves as a Lyapunov function. Its time derivative along the ODE \(\dot{w} = -Aw + b = -A(w-w^*)\) is \(\dot{V} = -2(w-w^*)^\top M A (w-w^*) \leq 0\) when \(MA\) is positive definite.

Lyapunov Stability Theorem: Consider \(\dot{\theta} = h(\theta)\) with equilibrium \(\theta^*\) (i.e., \(h(\theta^*)=0\)). If there exists \(V: \mathbb{R}^d \to \mathbb{R}_+\) such that:

$$V(\theta) > 0 \;\forall\; \theta \neq \theta^*, \quad V(\theta^*) = 0$$ $$\dot{V}(\theta) = \nabla V(\theta)^\top h(\theta) < 0 \;\forall\; \theta \neq \theta^*$$

then \(\theta^*\) is globally asymptotically stable. Every trajectory of the ODE converges to \(\theta^*\), and hence (by ODE method) the SA iterates converge a.s.

Standard choice for linear SA: \(V(\theta) = (\theta - \theta^*)^\top M (\theta - \theta^*)\) for positive definite \(M\).

The standard proof has four steps:

  1. Identify the mean field: Compute \(h(\theta) = \mathbb{E}[H(\theta,\xi) \mid \theta]\). Show \(h(\theta^*) = 0\).
  2. Verify noise is MDS: Show \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = 0\) and \(\mathbb{E}[\|M_{t+1}\|^2 \mid \mathcal{F}_t] \leq C(1 + \|\theta_t\|^2)\).
  3. Bounded iterates: Prove \(\sup_t \|\theta_t\| < \infty\) a.s. (often via a Lyapunov argument on the stochastic system, using projected SA or showing instability implies the algorithm escapes to ∞ with zero probability).
  4. ODE stability: Construct Lyapunov function \(V\), show \(\dot{V} < 0\), conclude \(\theta^* \) is GAS, invoke ODE method.

§7 Convergence Conditions

These conditions are necessary — skipping any one can break convergence. Here's exactly why each is required.

\(\sum \alpha_t = \infty\)
Infinite Travel Budget
Ensures the algorithm can move from any starting point to \(\theta^*\). If step sizes summed to a finite value, the algorithm might "run out of fuel" before reaching the target.
\(\sum \alpha_t^2 < \infty\)
Vanishing Noise Effect
Ensures noise is eventually suppressed. The variance of the SA iterate around the ODE trajectory is \(\mathcal{O}(\sum \alpha_t^2)\), which must vanish for exact convergence.
Lipschitz \(h\)
Controlled Dynamics
Prevents runaway dynamics in the ODE. Without it, the ODE may escape to infinity or exhibit chaotic behavior before reaching \(\theta^*\).
Bounded iterates
Compactness
The ODE method applies on compact sets. Without bounded iterates, the interpolated SA trajectory may not track the ODE. Often enforced via projection onto a compact set.
\(\mathbb{E}[M_{t+1} | \mathcal{F}_t]=0\)
Unbiased Noise
Ensures the algorithm aims at the right target. Biased noise would steer toward the wrong fixed point, and no amount of step-size decay can fix systematic bias.
\(\mathbb{E}[\|M_{t+1}\|^2 | \mathcal{F}_t] \leq C(1+\|\theta_t\|^2)\)
Controlled Variance
Prevents noise from overwhelming the signal. Allows the martingale convergence theorems to apply. In practice, bounded rewards ensure this for RL.
📐 Standard Step Size Choice
\(\alpha_t = \frac{1}{t+1}\) satisfies both conditions. Constant step sizes satisfy neither but give faster initial convergence and can track time-varying targets — used in DQN and modern deep RL, accepting approximate rather than exact convergence.

If only \(\sum \alpha_t = \infty\): Take constant step sizes \(\alpha_t = c > 0\). Then \(\sum \alpha_t = \infty\) but \(\sum \alpha_t^2 = \infty\). The iterates oscillate around \(\theta^*\) due to persistent noise, never settling. This is why constant-step-size SA converges only to a neighborhood, not to \(\theta^*\) exactly.

If only \(\sum \alpha_t^2 < \infty\): Take \(\alpha_t = 1/t^2\). Then \(\sum 1/t^2 < \infty\) but \(\sum 1/t^2 < \infty \Rightarrow \sum \alpha_t < \infty\). The algorithm can't travel far enough to correct a bad initialization. Convergence is not guaranteed from arbitrary starting points.

The Robbins-Monro conditions \(\sum \alpha_t = \infty, \sum \alpha_t^2 < \infty\) are the sweet spot: enough step-size to get anywhere, but decreasing fast enough to suppress noise.

§8 TD Learning as Stochastic Approximation ★

This is the central connection. We derive explicitly how TD(0) fits the canonical SA form.

The TD(0) Update

TD(0) with linear FA
$$w_{t+1} = w_t + \alpha_t \underbrace{\left(r_{t+1} + \gamma w_t^\top \phi(s_{t+1}) - w_t^\top \phi(s_t)\right)}_{\delta_t \text{ (TD error)}} \phi(s_t)$$

Identifying the SA Components

The TD update can be written as \(w_{t+1} = w_t + \alpha_t H(w_t, s_t, s_{t+1})\) where:

$$H(w, s, s') = (r(s) + \gamma w^\top\phi(s') - w^\top\phi(s))\phi(s)$$

Taking the expectation over \((s, s') \sim d^\pi \times P^\pi\) (stationary distribution × transition):

$$h(w) = \mathbb{E}_{s,s'}[H(w,s,s')] = \Phi^\top D(r + \gamma P^\pi \Phi w - \Phi w) = b - Aw$$

where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\) and \(b = \Phi^\top Dr\). The limiting ODE is:

$$\dot{w} = h(w) = b - Aw$$

Fixed point: \(w^* = A^{-1}b\) (the TD fixed point). The ODE converges iff \(A\) is positive definite.

Define \(D = \text{diag}(d^\pi)\) (stationary distribution under \(\pi\)), \(\Phi \in \mathbb{R}^{|\mathcal{S}| \times d}\) (feature matrix), \(P^\pi\) (transition matrix under \(\pi\)).

$$h(w) = \Phi^\top D (r + \gamma P^\pi \Phi w - \Phi w) = b - Aw$$ $$A = \Phi^\top D(I - \gamma P^\pi)\Phi, \quad b = \Phi^\top Dr$$

Noise decomposition:

$$M_{t+1} = H(w_t, s_t, s_{t+1}) - h(w_t)$$ $$= \delta_t \phi(s_t) - h(w_t)$$

Check: \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = \mathbb{E}[H(w_t, s_t, s_{t+1}) \mid w_t, s_t] - h(w_t) = h(w_t) - h(w_t) = 0\). ✓ MDS property holds.

Stability of ODE: \(A\) is PD under on-policy sampling because \(D\) is the stationary distribution, which satisfies \(d^\top P^\pi = d^\top\), making \(A\) expressible as a positive definite matrix via the detailed balance-like argument.

📐 Theorem: TD(0) Convergence (Tsitsiklis & Van Roy, 1997)
Under on-policy sampling, linear function approximation, Robbins-Monro step sizes, and bounded features: \(w_t \to w^* = A^{-1}b\) almost surely, where \(\overline{VE}(w^*) \leq \frac{1}{1-\gamma} \min_w \overline{VE}(w)\).
TD(0) Weight Evolution
Watch weights converge to the TD fixed point w*
Step schedule:
⚙ TD(0) as Canonical SA
SA form: w_{t+1} = w_t + α_t [h(w_t) + M_{t+1}]
Where:
h(w) = b - Aw // mean field (expected update)
M_{t+1} = δ_t φ(s_t) - h(w_t) // martingale noise
A = Φᵀ D(I - γPᵖ)Φ // positive definite (on-policy)
b = Φᵀ Dr // expected reward features
ODE: ẇ = b - Aw → w* = A⁻¹b // globally stable iff A is PD

§9 Q-Learning as Stochastic Approximation

The Tabular Q-Learning Update

$$Q_{t+1}(s,a) = Q_t(s,a) + \alpha_t\left[r + \gamma \max_{a'} Q_t(s',a') - Q_t(s,a)\right]$$

Identifying the SA Structure

Q-learning is a nonlinear SA due to the max operator. The mean field is:

$$h(Q)(s,a) = \mathbb{E}_{s'}\left[r(s,a) + \gamma \max_{a'} Q(s',a') - Q(s,a)\right] = (\mathcal{T}^* Q)(s,a) - Q(s,a)$$

where \(\mathcal{T}^*\) is the Bellman optimality operator. Fixed point: \(\mathcal{T}^* Q^* = Q^*\), so \(h(Q^*) = 0\). ✓

The Bellman optimality operator \(\mathcal{T}^*\) is a \(\gamma\)-contraction in \(L^\infty\): \(\|\mathcal{T}^* Q - \mathcal{T}^* Q'\|_\infty \leq \gamma \|Q - Q'\|_\infty\). This means the ODE \(\dot{Q} = h(Q) = \mathcal{T}^*Q - Q\) has \(Q^*\) as its unique globally asymptotically stable equilibrium. The Lyapunov function \(V(Q) = \|Q - Q^*\|_\infty\) works: \(\dot{V} \leq -(1-\gamma)\|Q - Q^*\|_\infty < 0\).

With function approximation \(Q(s,a;\theta)\), the fixed-point equation becomes \(\Pi \mathcal{T}^* \hat{Q}(\cdot;\theta) = \hat{Q}(\cdot;\theta)\), where \(\Pi\) is the projection onto the function class. The composed operator \(\Pi \mathcal{T}^*\) is NOT necessarily a contraction! This is because:

$$\|\Pi \mathcal{T}^* Q - \Pi \mathcal{T}^* Q'\| \leq \|\Pi\| \cdot \|\mathcal{T}^* Q - \mathcal{T}^* Q'\| \leq \gamma \|Q - Q'\|$$

But the norm must be the same for this to work. Under the \(\mu\)-weighted norm, \(\|\Pi\|_\mu \leq 1\) but \(\|\mathcal{T}^*\|\) is controlled in \(L^\infty\), not \(L^2_\mu\). The norms don't match, and the composition can be non-contractive → divergence possible.

⚠ Off-Policy + Max Operator
Q-learning is inherently off-policy (uses \(\max_{a'}\) not the current policy). This breaks the on-policy stationary distribution argument needed for the mean field matrix \(A\) to be positive definite. With function approximation, this is one corner of the Deadly Triad.

Watkins and Dayan proved tabular Q-learning converges a.s. if (1) all state-action pairs are visited infinitely often, (2) step sizes satisfy RM conditions. The proof uses the contraction property of \(\mathcal{T}^*\) directly:

Let \(\Delta_t(s,a) = Q_t(s,a) - Q^*(s,a)\). Define \(F_t = \max_{s,a} |\Delta_t(s,a)|\). The Q-learning update gives:

$$|\Delta_{t+1}(s_t,a_t)| \leq (1-\alpha_t)|\Delta_t(s_t,a_t)| + \alpha_t \gamma F_t + \alpha_t |M_{t+1}|$$

Taking sup and applying the Robbins-Siegmund theorem with \(V_t = F_t\), one shows \(F_t \to 0\) a.s., i.e., \(Q_t \to Q^*\) uniformly.

§10 Two-Timescale Stochastic Approximation

The Core Idea: Fast and Slow Iterates

When two interacting SA recursions need to be run simultaneously — and one converges much faster than the other — two-timescale SA provides the right framework.

🐇 Fast Timescale
Critic (Value Function)
\(\alpha_t \gg \beta_t\)
Updates \(w_t\) (weights of \(\hat{V}\)) at a fast rate. Sees the actor \(\theta_t\) as approximately fixed.
🐢 Slow Timescale
Actor (Policy)
\(\beta_t \ll \alpha_t\)
Updates \(\theta_t\) (policy parameters) slowly. Sees the critic \(w_t\) as having converged to its steady state given \(\theta_t\).

Imagine a manager (actor) and an analyst (critic). The analyst computes reports (value estimates) much faster than the manager makes strategy decisions (policy updates). When the manager looks at the analyst's latest report, it's essentially the optimal report for the current strategy — because the analyst has had time to converge.

This timescale separation lets us decouple two coupled SA iterations: the fast one sees the slow one as approximately constant, and the slow one sees the fast one as already at its equilibrium.

Two-timescale SA (Borkar, 1997):

$$w_{t+1} = w_t + \alpha_t [h_1(w_t, \theta_t) + M_{t+1}^w]$$ $$\theta_{t+1} = \theta_t + \beta_t [h_2(w_t, \theta_t) + M_{t+1}^\theta]$$

with \(\beta_t / \alpha_t \to 0\) (i.e., \(\alpha_t \gg \beta_t\)).

Theorem (Borkar 1997): Under appropriate conditions, the coupled system converges as if:

  1. The fast iterate \(w_t\) tracks its ODE \(\dot{w} = h_1(w, \theta)\) with \(\theta\) frozen.
  2. It converges to \(\lambda(\theta)\) (the fast-system equilibrium given \(\theta\)).
  3. The slow iterate \(\theta_t\) then tracks the "reduced ODE" \(\dot{\theta} = h_2(\lambda(\theta), \theta)\).

Actor-Critic as Two-Timescale SA

⚙ Actor-Critic: Two-Timescale Decomposition
Fast (Critic): w_{t+1} = w_t + α_t δ_t φ(s_t) // TD update for V̂
h₁(w,θ) = b(θ) - A(θ)w // converges to w*(θ) = A(θ)⁻¹b(θ)
Slow (Actor): θ_{t+1} = θ_t + β_t ∇_θ log π_θ(a_t|s_t) δ_t // policy gradient
h₂(w,θ) = ∇_θ J(θ)|_{w=w*(θ)} // gradient w.r.t. converged critic
Condition: β_t / α_t → 0 // actor slower than critic — essential!
💡 Why Step-Size Ratio Matters
If \(\beta_t \approx \alpha_t\) (same timescale), the critic hasn't converged when the actor updates — the actor uses a biased value estimate, breaking the policy gradient. In practice, using \(\alpha_t = 0.1\) for critic and \(\beta_t = 0.001\) for actor approximates the two-timescale condition.

§11 Where SA Breaks in RL

The Three Failure Modes

1. Function Approximation Changes the Mean Field

With linear FA, the TD update projects the Bellman target onto the function class. The mean field becomes \(h(w) = b - Aw\) where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\). The critical question: is \(A\) positive definite? Under on-policy sampling, yes. Under off-policy, no — and SA with non-PD drift matrices can diverge (exploding iterates).

2. Off-Policy Sampling Breaks the MDS Structure

When using a behavior policy \(\mu \neq \pi\), the stationary distribution in the mean field is \(d^\mu\), but the Bellman equation targets \(V^\pi\). The matrix \(A\) becomes \(A = \Phi^\top D^\mu (I - \gamma P^\pi)\Phi\), which is not guaranteed PD because \(D^\mu\) and \(P^\pi\) correspond to different policies. The ODE can have unstable equilibria.

3. Nonlinear FA: No Fixed-Point Guarantee

With neural networks, \(\hat{Q}(\cdot;\theta)\) is nonlinear in \(\theta\). The mean field \(h(\theta)\) may have multiple fixed points, saddle points, or no fixed point at all within the function class. The ODE analysis becomes intractable without strong assumptions on the network architecture and initialization.

💀 The Deadly Triad as SA Breakdown
FA + Bootstrapping + Off-policy = Each of the three SA conditions breaks simultaneously:
  • FA: Mean field \(h\) may not have a stable fixed point in the function class
  • Bootstrapping: The target \(r + \gamma \hat{V}(s';\theta)\) depends on \(\theta\), making the mean field calculation involve the current iterate — not just data
  • Off-policy: The noise \(M_{t+1}\) is no longer MDS under the stationary distribution of \(\pi\) — the MDS property requires the distribution used for sampling to match the policy being evaluated

GTD (Sutton et al. 2009) fixes the off-policy SA breakdown by minimizing the Projected Bellman Error (PBE): \(\overline{PBE}(\theta) = \|\Pi(\mathcal{T}^\pi \hat{V}(\cdot;\theta) - \hat{V}(\cdot;\theta))\|^2_\mu\). This is a true objective with a well-defined gradient. The SA update is:

$$\theta_{t+1} = \theta_t + \alpha_t \rho_t (\phi(s_t) - \gamma \phi(s_{t+1}))(\phi(s_t)^\top w_t)$$
$$w_{t+1} = w_t + \beta_t (\delta_t - \phi(s_t)^\top w_t)\phi(s_t)$$

where \(\rho_t = \pi(a_t|s_t)/\mu(a_t|s_t)\) is the IS ratio. This is a two-timescale SA with provable convergence under off-policy sampling — the mean field now corresponds to the gradient of PBE, which always points toward the minimum.

§12 Practical Insights from SA Theory

Learning Rate Selection

ScheduleFormulaPropertiesUse Case
Polynomial decay\(\alpha_t = c/(t+t_0)^\rho\)RM conditions for \(\rho \in (0.5, 1]\)Tabular RL, theoretical guarantees
Constant\(\alpha_t = \alpha\)Converges to neighborhood \(\mathcal{O}(\alpha)\)Deep RL, non-stationary targets
Square root\(\alpha_t = c/\sqrt{t}\)RM conditions, slower than 1/tCompromise: faster early, proper decay
Adam/RMSPropAdaptive per-parameterNo formal RM guarantees in RL settingDeep RL practice (DQN, PPO, SAC)

Why Divergence Happens in Practice

📈
Unbounded Iterates
If h(θ) doesn't point back toward the stable set for large ‖θ‖, iterates can explode. Gradient clipping is a practical fix.
🎯
Moving Targets
Bootstrapped targets (TD, Q-learning) change as θ changes. Constant step sizes + changing targets = persistent oscillation.
Too Large α
Step size too large → Euler discretization error too large → ODE tracking breaks → iterates diverge from ODE trajectory.
🔗
Correlated Samples
Sequential samples break variance assumptions. Experience replay decorrelates samples, restoring approximate i.i.d. behavior.
✓ Practical Checklist for RL Algorithms
  1. Identify \(h(\theta)\) — what does the algorithm optimize in expectation?
  2. Check: does \(h\) have a stable fixed point? (Use Lyapunov or PD matrix argument)
  3. Is noise MDS? (Check: is sampling on-policy? Are IS ratios bounded?)
  4. Are iterates bounded? (Project onto compact set or verify via Lyapunov)
  5. Do step sizes satisfy RM? (Use \(1/(t+1)\) for theory, tune constant for practice)
  6. If two coupled recursions: is there timescale separation? (\(\beta_t/\alpha_t \to 0\))

§13 Historical Timeline

1951
Robbins & Monro — Stochastic Approximation Founded
The seminal paper introduces the RM algorithm for root-finding under noise. Proves a.s. convergence with decreasing step sizes. Establishes the two-condition step-size rule. The starting point of the entire field.
1977
Ljung — The ODE Method
Lars Ljung proves that SA iterates track solutions of a deterministic ODE (the "associated ODE"). Converts stochastic convergence analysis to deterministic stability analysis. This is the foundational tool for all RL convergence proofs.
1978
Kushner & Clark — Weak Convergence Theory for SA
Extends Ljung's ODE method using weak convergence / martingale problem techniques. Handles Markov noise and more general noise models. Sets up the rigorous framework needed for RL convergence analysis.
1988
Sutton — TD Learning
TD(λ) introduced as a method bridging Monte Carlo and dynamic programming. First major application of SA ideas to RL (though convergence not formally proven then). Starts the RL-SA connection.
1989–1992
Watkins & Dayan — Q-Learning Convergence
Q-learning introduced (Watkins 1989) and convergence proven (Watkins & Dayan 1992). First formal proof that an off-policy RL algorithm converges — using contraction of the Bellman operator, directly connecting to SA theory.
1997
Tsitsiklis & Van Roy — TD with Function Approximation
Prove a.s. convergence of linear TD(0) under on-policy sampling using the ODE method. Show the bounded approximation error result. Identify the Deadly Triad. This remains the gold standard for RL convergence proofs.
1997–2000
Borkar — Two-Timescale SA
Vijay Borkar formalizes two-timescale SA and its convergence. Provides the mathematical foundation for actor-critic methods. The "Borkar-Meyn theorem" (2000) strengthens the ODE method with better conditions for RL.
2009
Sutton, Maei et al. — Gradient TD
GTD and TDC introduced as true-gradient methods converging under off-policy sampling. Fixes the SA breakdown from the Deadly Triad for linear FA. Uses two-timescale SA internally.
2013–present
Deep RL Era — SA Theory Lagging Practice
DQN, A3C, PPO, SAC achieve remarkable empirical results using SA intuitions (target networks, replay buffers) without formal SA guarantees. Recent work (mean-field theory, NTK analysis, implicit regularization) attempts to close the gap.
🗺 What You Now Have: An SA Toolkit for RL
After this guide, you can: (1) write any RL algorithm update as θ_{t+1} = θ_t + α_t[h(θ_t) + M_{t+1}]; (2) identify h(θ), check for stable fixed points via ODE analysis; (3) verify MDS noise; (4) apply the ODE method + Lyapunov functions; (5) understand why the Deadly Triad breaks these conditions; (6) read and understand the convergence proofs in Tsitsiklis & Van Roy (1997), Borkar & Meyn (2000), and related work.