Research-Grade Guide

Model-Free Reinforcement Learning

A complete conceptual and theoretical treatment — from intuition to stochastic approximation theory — designed for MTech (Research) and PhD students.

📖 ~90 min read 🔢 MathJax equations 🎛️ Interactive visualizations 🧪 Research depth

§ 1

Motivation & Intuition

Consider a robot learning to walk. The robot does not know the physics of its body — its joint torques, friction coefficients, or how the ground responds. Yet, through repeated trial and error, it can still learn a good walking policy. This is the essence of model-free reinforcement learning: learning to act optimally without ever explicitly constructing a model of the environment.

Core Intuition

An agent interacts with an environment, observing states, taking actions, and receiving rewards. In model-free RL, the agent learns the value of states or actions directly from this experience stream, without ever fitting a transition model \(P(s'|s,a)\) or reward model \(R(s,a)\).

Why Avoid Models?

Model-based RL sounds appealing — if you know the model, you can plan. But models are hard to learn correctly, especially in high-dimensional continuous spaces. Errors in the model compound: a slightly wrong transition model leads to systematically biased value estimates, which leads to bad policies. Model-free methods bypass this entirely.

⚙️ Model-Based RL

Learns \(\hat{P}(s'|s,a)\) and \(\hat{R}(s,a)\), then plans using dynamic programming. More sample efficient but errors in model can mislead. Works well when model is learnable and accurate.

🎯 Model-Free RL

Learns value functions or policies directly from experience. More robust to complex dynamics. Works even when the environment is a black box. The focus of this guide.

Formally, in model-free RL, we never compute or estimate the transition kernel \(P: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S})\). Instead, we estimate value functions:

Value Function (Model-Free Target) \[ V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \;\bigg|\; S_0 = s \right] \]

This expectation is estimated purely from sample trajectories \( (S_0, A_0, R_1, S_1, A_1, \ldots)\) obtained by running policy \(\pi\) in the environment.

§ 2

Historical Context

Model-free RL has a rich intellectual genealogy spanning six decades, with contributions from mathematics, psychology, neuroscience, and computer science.

Timeline of Key Ideas

Key Milestones

1950s — Bellman & Dynamic Programming

Richard Bellman formulated the optimality equations that bear his name. The Bellman equation is the theoretical heart of all value-based RL, but requires a known model to apply directly.

1950s–60s — Robbins-Monro & Stochastic Approximation

Herbert Robbins and Sutton Monro introduced stochastic approximation — iterative algorithms for root-finding under noise. This would become the theoretical backbone of all model-free RL algorithms.

1988 — Sutton's TD Learning

Richard Sutton introduced Temporal Difference learning in his landmark 1988 paper, bridging Monte Carlo methods and dynamic programming. This paper essentially founded modern model-free RL.

1989 — Q-Learning (Watkins)

Chris Watkins introduced Q-learning, the first provably convergent off-policy model-free control algorithm. The convergence proof, completed by Watkins and Dayan in 1992, used stochastic approximation theory.

§ 3

Formal Setup: MDPs & Value Functions

Markov Decision Process

A Markov Decision ProcessA mathematical framework for sequential decision-making where outcomes are partly random and partly controllable by an agent. (MDP) is a tuple \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

MDP Components \[ \begin{aligned} \mathcal{S} &: \text{State space} \\ \mathcal{A} &: \text{Action space} \\ P(s'|s,a) &: \text{Transition kernel (unknown in model-free RL)} \\ R(s,a) &: \text{Expected reward function} \\ \gamma \in [0,1) &: \text{Discount factor} \end{aligned} \]

Bellman Equations

The Bellman equationThe Bellman equation decomposes the value of a state into the immediate reward plus the discounted value of the next state — a recursive relationship that defines optimal behavior. expresses the value of a state in terms of successor states:

Bellman Expectation Equation \[ V^\pi(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right] \]

Bellman Optimality Equation \[ V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right] \]

In model-free RL, we cannot evaluate these equations directly because we do not know \(P(s'|s,a)\). We must estimate these values from samples. This leads to Monte Carlo and TD methods.

Action-Value Functions

Q-Function \[ Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t R_{t+1} \;\Big|\; S_0=s, A_0=a\right] \]

Q-functions are more convenient for control because the greedy policy update is \(\pi'(s) = \arg\max_a Q(s,a)\), requiring no model knowledge.

§ 4

Monte Carlo Methods

Intuition

Imagine you want to estimate the average height of people in a city. You can't measure everyone, but you can sample a large number of people and average the measurements. The law of large numbers guarantees your estimate converges to the truth.

Monte Carlo (MC) RL applies exactly this idea: run an episode to completion, collect the return, and use it as a sample estimate of the true value.

Key Insight

Because \(V^\pi(s) = \mathbb{E}[G_t | S_t = s]\), we can estimate it by averaging many observed returns \(G_t\) from visits to state \(s\). This is just empirical averaging — no model needed.

Mathematics

Return from time t \[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} \]

For an episodic task (terminal state reached at time \(T\)), this simplifies to:

G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T

First-Visit MC Estimation

Algorithm: First-Visit MC Policy Evaluation

// Initialize
Initialize \(V(s) = 0\), Returns\((s) = []\) for all \(s \in \mathcal{S}\)

for each episode:
  Generate episode: \(S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_T\)
  \(G \leftarrow 0\)
  for \(t = T-1, T-2, \ldots, 0\):
    \(G \leftarrow \gamma G + R_{t+1}\)
    if \(S_t\) not in \(\{S_0, \ldots, S_{t-1}\}\):
      Returns\((S_t)\).append\((G)\)
      \(V(S_t) \leftarrow \) average(Returns\((S_t)\))

Incremental Update Form

Instead of storing all returns, we can update incrementally. After collecting \(G_t\) from the \(n\)-th visit to state \(s\):

Incremental MC Update \[ V(s) \leftarrow V(s) + \frac{1}{N(s)} \left[ G_t - V(s) \right] \]

This has the form of a stochastic approximation update — a preview of the theory in Section 8.

🎲 Monte Carlo Return Estimation

Watch the running average of returns converge to the true value. Each episode generates one noisy return sample.

Episodes: 0 | Estimate: — | True Value: 3.33

Pros & Cons

✅ Advantages

• Unbiased estimates of \(V^\pi(s)\)
• No bootstrapping — no propagation of estimate errors
• Simple and easy to implement
• Works with non-Markov environments

❌ Disadvantages

• High variance (full trajectories can vary enormously)
• Must wait for episode completion
• Slow convergence — \(O(1/\sqrt{n})\) variance reduction
• Cannot handle continuing tasks without truncation

§ 5

Temporal Difference Learning

The Key Idea: Bootstrapping

MC methods wait for the full return. Can we do better? Yes — by using our current estimate of future values to update the current estimate. This is called bootstrappingBootstrapping means updating estimates based on other estimates, rather than waiting for the final ground-truth outcome. Like pulling yourself up by your own bootstraps — using current knowledge to improve current knowledge..

Bootstrapping Intuition

Instead of waiting until the end of an episode to see the total return \(G_t\), TD uses the one-step return: the observed reward plus the discounted estimated value of the next state. This is biased but lower variance.

TD(0): The Simplest TD Algorithm

After observing the transition \((S_t, A_t, R_{t+1}, S_{t+1})\), TD(0) performs the update:

TD(0) Update Rule \[ V(S_t) \leftarrow V(S_t) + \alpha \underbrace{\left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]}_{\delta_t: \text{ TD error}} \]

The term \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is called the TD error. It measures the discrepancy between the current estimate and the one-step estimate. When \(\delta_t = 0\) for all states, the values satisfy the Bellman equation.

Target vs. Estimate

MC target: \(G_t = R_{t+1} + \gamma R_{t+2} + \cdots\) (unbiased, high variance)
TD target: \(R_{t+1} + \gamma V(S_{t+1})\) (biased, lower variance)

⚡ TD(0) Value Update Animation

Each step shows a single TD update. The value estimates converge toward the true values over time.

Step size α = 0.10

Steps: 0

Bias-Variance Tradeoff

This is one of the most important conceptual tensions in RL:

Property	Monte Carlo	TD(0)
Bias	Zero (uses true return)	Biased (depends on current \(V\))
Variance	High (long horizon, many random steps)	Low (one-step lookahead only)
Data efficiency	Lower	Higher (online updates)
Convergence	Guaranteed (with enough samples)	Guaranteed (tabular, on-policy)
Markov property required?	No	Yes (to be unbiased in the limit)

Unifying View: TD(λ)

TD(λ) interpolates between TD(0) and MC by mixing n-step returns:

λ-return \[ G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}, \quad \text{where } G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n}) \]

At \(\lambda=0\): pure TD(0). At \(\lambda=1\): equivalent to MC. Values in between trade bias for variance.

§ 6

Monte Carlo vs. Temporal Difference Critical

This comparison is arguably the most important conceptual distinction in model-free RL.

Sutton's Key Insight (1988)

TD methods combine the sampling idea of Monte Carlo with the bootstrapping idea of Dynamic Programming. They are neither — and better than both in many practical settings.

When to prefer MC?

MC is preferred when: (1) episodes are short, (2) the Markov property may not hold, (3) you need unbiased estimates, or (4) the environment is non-stationary in complex ways.

When to prefer TD?

TD is preferred when: (1) episodes are long or continuing, (2) online learning is important, (3) the Markov property holds, or (4) sample efficiency matters.

📐 Theoretical Comparison: Convergence Properties ▼

Both MC and TD(0) converge to \(V^\pi\) under appropriate conditions. However, their convergence properties differ:

MC Convergence (LLN) \[ \hat{V}_n(s) = \frac{1}{n}\sum_{i=1}^n G_t^{(i)} \xrightarrow{a.s.} V^\pi(s) \text{ as } n\to\infty \]

TD(0) Convergence (Tsitsiklis, 1994) \[ V_t(s) \xrightarrow{a.s.} V^\pi(s) \text{ under Robbins-Monro step sizes and ergodicity} \]

TD convergence is more subtle and requires stochastic approximation theory (Section 8). MC convergence is a direct application of the law of large numbers.

§ 7

TD Control: SARSA & Q-Learning

To find an optimal policy, we need control algorithms — algorithms that improve the policy, not just evaluate it. In model-free settings, we typically learn Q-values and apply greedy improvement.

SARSA: On-Policy TD Control

SARSA learns Q-values using the actual action taken by the current policy (on-policy):

SARSA Update (S, A, R, S', A' → update) \[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right] \]

SARSA converges to the optimal Q-function under the condition that all state-action pairs are visited infinitely often (e.g., with ε-greedy exploration with decaying ε).

Q-Learning: Off-Policy TD Control

Q-learning learns the optimal Q-function, regardless of the policy being followed (off-policy):

Q-Learning Update (Watkins, 1989) \[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right] \]

Key Difference

SARSA bootstraps off the next action actually taken \(A_{t+1} \sim \pi\). Q-learning bootstraps off the greedy best action \(\max_{a'} Q(S_{t+1}, a')\), making it off-policy.

🔬 Q-Learning Convergence Theorem ▼

Theorem (Watkins & Dayan, 1992): Q-learning converges to \(Q^*\) almost surely if:

\sum_{t=0}^\infty \alpha_t(s,a) = \infty, \quad \sum_{t=0}^\infty \alpha_t^2(s,a) < \infty \quad \forall (s,a)

and all state-action pairs are visited infinitely often. This is a consequence of stochastic approximation theory, which we study in depth next.

§ 8

Stochastic Approximation Core Theory

Stochastic approximation (SA) is the mathematical theory underlying the convergence of all model-free RL algorithms. Understanding it is essential for anyone doing RL research. This section is the theoretical core of the guide.

(A) Intuition: Learning from Noisy Observations

Suppose you want to find the temperature at which a chemical reaction reaches equilibrium — but your thermometer is noisy. You can't solve it analytically. Instead, you make repeated noisy measurements and average them. Each noisy measurement is a "stochastic approximation" to the truth.

Now generalize: suppose you want to find \(\theta^*\) such that \(h(\theta^*) = 0\), but you can only observe \(h(\theta) + \text{noise}\). Can you still find \(\theta^*\)? Yes — with the right iterative algorithm.

The deeper connection: TD learning is trying to solve the Bellman equation. The Bellman operator \(\mathcal{T}^\pi V\) satisfies \(\mathcal{T}^\pi V^\pi = V^\pi\), so finding \(V^\pi\) is equivalent to finding a fixed point. But we cannot apply \(\mathcal{T}^\pi\) directly — we only have noisy one-step samples from the environment.

Stochastic approximation is the framework for finding fixed points (or zeros) of operators when you only have noisy, sampled access to them.

Core Correspondence

DP: \(V \leftarrow \mathcal{T}^\pi V\) (requires model)
SA: \(V \leftarrow V + \alpha[\text{noisy sample of } \mathcal{T}^\pi V - V]\) (model-free)

(B) Canonical Form of Stochastic Approximation

The standard SA recursion takes the following form:

Stochastic Approximation Recursion \[ \theta_{t+1} = \theta_t + \alpha_t \left( h(\theta_t) + \xi_t \right) \]

\(\theta_t\)	Current parameter estimate at iteration \(t\)
\(\alpha_t\)	Step size (learning rate) at time \(t\) — must satisfy Robbins-Monro conditions
\(h(\theta_t)\)	The "mean field" — what the update is trying to do on average; related to solving \(h(\theta^*)=0\)
\(\xi_t\)	Noise term — zero-mean perturbation due to stochasticity in the data/environment

The goal: under appropriate conditions on \(h\), \(\alpha_t\), and \(\xi_t\), we want \(\theta_t \to \theta^*\) almost surely, where \(h(\theta^*) = 0\).

(C) The Robbins-Monro Algorithm

The original SA paper by Robbins & Monro (1951) addressed the following problem: find \(\theta^*\) such that \(M(\theta^*) = \alpha\), given only noisy observations of \(M(\theta)\).

Problem Setup (Robbins & Monro, 1951)

Given a function \(M: \mathbb{R} \to \mathbb{R}\) with \(M(\theta^*) = \alpha\), but we can only observe \(Y = M(\theta) + \epsilon\) with \(\mathbb{E}[\epsilon|\theta] = 0\). Find \(\theta^*\).

Robbins-Monro Update \[ \theta_{t+1} = \theta_t - \alpha_t \left( Y_t - \alpha \right), \quad Y_t = M(\theta_t) + \epsilon_t \]

Setting \(h(\theta) = \alpha - M(\theta)\), this fits the canonical SA form. If \(M\) is non-decreasing, the algorithm converges to \(\theta^*\).

📐 Convergence of Robbins-Monro (Sketch) ▼

Theorem (Robbins-Monro, 1951): If (1) \(M\) is non-decreasing with \(M(\theta^*) = \alpha\), (2) \(\sum_t \alpha_t = \infty\), (3) \(\sum_t \alpha_t^2 < \infty\), and (4) \(\text{Var}(Y_t|\theta_t) \leq C\) for some constant \(C\), then:

\mathbb{E}[(\theta_t - \theta^*)^2] \to 0 \text{ as } t \to \infty

Proof sketch: Define \(e_t = \theta_t - \theta^*\). Then:

e_{t+1} = e_t + \alpha_t h(\theta_t) + \alpha_t \xi_t

Squaring and taking expectations, using the decay of \(\alpha_t\) to control the noise term and the sign of \(h\) to show the mean term drives \(e_t \to 0\).

(D) Connection to TD Learning

Now we reveal why TD learning works. Let us show that TD(0) is exactly a stochastic approximation algorithm.

Recall the TD(0) update for state \(s\):

V(s) \leftarrow V(s) + \alpha \left[ R + \gamma V(s') - V(s) \right]

Now, what is the expected update? Taking expectation over the random transition \((R, s')\):

Mean Field of TD(0) \[ \mathbb{E}[R + \gamma V(s') - V(s) \;|\; S_t = s] = \underbrace{\left(\mathcal{T}^\pi V\right)(s) - V(s)}_{h(V)(s)} \]

where \(\mathcal{T}^\pi\) is the Bellman operator. So TD(0) fits the SA form with:

\(\theta_t\)	Current value function \(V_t\)
\(h(\theta_t)\)	\(\mathcal{T}^\pi V_t - V_t\) (Bellman residual — zero exactly at \(V^\pi\))
\(\xi_t\)	\(\left[R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)\right] - h(V_t)(S_t)\) (deviation from mean)

Fundamental Insight

TD learning is a stochastic approximation for solving the Bellman fixed-point equation \(\mathcal{T}^\pi V = V\). Each TD update is a noisy step in the direction of the Bellman operator. The noise comes from using a single sample transition instead of the full expectation.

(E) The ODE Method Key Tool

The most powerful technique for analyzing SA algorithms is the Ordinary Differential Equation (ODE) method, introduced by Ljung (1977) and developed by Kushner & Clark, Borkar, and others.

Core Idea

As the step sizes \(\alpha_t \to 0\), the SA trajectory \(\{\theta_t\}\) starts to look like the continuous-time flow of the ODE \(\dot{\theta} = h(\theta)\). The discrete stochastic system "shadows" the deterministic ODE.

Limiting ODE \[ \frac{d\theta}{dt} = h(\theta) \]

Informal statement: If the ODE \(\dot{\theta} = h(\theta)\) has a globally asymptotically stable equilibrium \(\theta^*\) (i.e., \(h(\theta^*) = 0\) and the Jacobian at \(\theta^*\) has eigenvalues with negative real parts), then under appropriate conditions, \(\theta_t \to \theta^*\) almost surely.

🌀 SA Convergence vs. ODE Flow

The stochastic SA iterate (blue) follows the deterministic ODE trajectory (red dashed). As α decreases, the SA path hugs the ODE more closely.

Noise level σ = 0.30

📐 ODE Method: Formal Statement (Borkar-Meyn, 2000) ▼

Theorem: Consider the SA recursion \(\theta_{t+1} = \theta_t + \alpha_t(h(\theta_t) + \xi_t)\). Assume:

\(h\) is Lipschitz continuous
Robbins-Monro step sizes: \(\sum \alpha_t = \infty\), \(\sum \alpha_t^2 < \infty\)
Noise: \(\mathbb{E}[\xi_{t+1} | \mathcal{F}_t] = 0\), \(\sup_t \mathbb{E}[\|\xi_t\|^2 | \mathcal{F}_{t-1}] < \infty\) a.s.
The ODE \(\dot{\theta} = h(\theta)\) has a globally asymptotically stable equilibrium \(\theta^*\)
The iterates \(\{\theta_t\}\) are bounded almost surely

Then \(\theta_t \to \theta^*\) almost surely.

The key technical tool is the "martingale difference" property of the noise and the Kushner-Clark lemma showing the iterate's interpolated trajectory stays close to the ODE flow.

(F) Conditions for Convergence

1. Robbins-Monro Step Size Conditions

The step sizes \(\{\alpha_t\}\) must satisfy:

Robbins-Monro Conditions \[ \sum_{t=0}^\infty \alpha_t = \infty \quad \text{(divergence: enough total learning)} \] \[ \sum_{t=0}^\infty \alpha_t^2 < \infty \quad \text{(square-summability: noise dies out)} \]

Why \(\sum \alpha_t = \infty\)?

The algorithm must be able to "reach" the solution from any starting point. If the total step size is finite, the algorithm can only move a bounded distance from initialization — and may stop before reaching \(\theta^*\).

Why \(\sum \alpha_t^2 < \infty\)?

The noise contribution at step \(t\) is \(\alpha_t \xi_t\). The variance grows like \(\alpha_t^2 \sigma^2\). For the total accumulated noise variance to be finite (and hence for the noise to not dominate), we need \(\sum \alpha_t^2 < \infty\).

Typical Step Size Choices

Schedule	Formula	Satisfies RM?	Notes
Harmonic (standard)	\(\alpha_t = \frac{1}{t+1}\)	✅ Yes	Optimal rate, \(\sum 1/t = \infty\), \(\sum 1/t^2 < \infty\)
Constant	\(\alpha_t = \alpha\)	❌ No	Fails \(\sum \alpha^2 < \infty\); converges to a neighborhood, not exact solution
Polynomial decay	\(\alpha_t = \frac{1}{(t+1)^\beta}\), \(\beta \in (0.5, 1]\)	✅ Yes	Faster for \(\beta < 1\), but noise reduction slower
Too fast decay	\(\alpha_t = \frac{1}{(t+1)^{1.5}}\)	❌ No	Fails \(\sum \alpha_t = \infty\); algorithm freezes

2. Noise Conditions

The noise \(\xi_t\) must be a martingale difference sequence: \(\mathbb{E}[\xi_{t+1} | \mathcal{F}_t] = 0\), where \(\mathcal{F}_t\) is the natural filtration. Additionally, the second moments must be bounded: \(\mathbb{E}[\|\xi_t\|^2 | \mathcal{F}_{t-1}] \leq C(1 + \|\theta_t\|^2)\).

3. Boundedness

A technically subtle requirement: the iterate \(\theta_t\) must remain bounded almost surely. In practice, this is often ensured by projection onto a compact set, or by showing that \(h\) has certain stability properties. The Borkar-Meyn theorem gives conditions under which boundedness can be established.

(G) Challenges in Applying SA to RL

Standard SA theory assumes i.i.d. noise. RL violates this in fundamental ways.

🔴 Markov Noise

In RL, the noise \(\xi_t\) comes from the Markov chain of states. It is not i.i.d. — consecutive samples are correlated. The theory must be extended to handle "Markov noise" using mixing time arguments.

🟠 Bootstrapping

TD learning updates toward a target that itself depends on the current estimate. This creates a "moving target" effect — the function \(h\) is effectively changing over time, which is not handled by basic SA theory.

🟡 Function Approximation

With function approximation (e.g., neural networks), the SA iterate lives in parameter space, not value function space. The effective mean field \(h\) becomes non-contractive, and convergence can fail.

🔵 Off-Policy Distribution Shift

In off-policy learning, the data distribution (behavior policy) doesn't match the target policy. This violates the stationarity assumption needed for the noise to be martingale difference.

(H) The Deadly Triad

Sutton & Barto (2018): The Deadly Triad

Instability and divergence of value function estimation arise when all three of the following are combined:

1. Function Approximation

Using a parameterized function class (linear, neural net) to represent \(V\) or \(Q\). Necessary for large state spaces, but introduces approximation error.

2. Bootstrapping

Using current estimates as targets (as in TD). Efficient but creates feedback loops in the SA updates that can destabilize.

3. Off-Policy Learning

Learning about a target policy using data from a different behavior policy. Efficient for data reuse, but the distribution mismatch breaks the martingale difference structure of the noise.

From an SA perspective: the deadly triad violates the ODE method's key assumption that the mean field \(h\) drives the system to a stable equilibrium. When all three are combined, \(h\) can be non-contractive, leading to divergence. Baird's counterexample (1995) gives a simple linear FA + TD + off-policy setting that diverges provably.

💥 SA Convergence: Good vs. Deadly Triad

Compare stable SA iteration (left) with an unstable situation analogous to the deadly triad (right).

Left: stable (Robbins-Monro satisfied). Right: divergent (constant α + non-contractive h).

💡 (I) Practical Interpretation: Why Learning Rates Matter ▼

Too large α

The noise term \(\alpha \xi_t\) dominates — the algorithm oscillates wildly. The SA iterate never settles. In deep RL, this manifests as divergence of Q-values or unstable training curves.

Too small α (decays too fast)

If \(\alpha_t \to 0\) too fast (e.g., \(\alpha_t = 1/t^2\)), the algorithm loses the ability to update. The iterate "freezes" before reaching the optimum. This is the failure mode \(\sum \alpha_t < \infty\).

Optimal: slow polynomial decay

The harmonic schedule \(\alpha_t = 1/(t+1)\) is asymptotically optimal for quadratic SA problems. In practice, constant step sizes are used with experience replay to break correlations and reduce effective variance — a pragmatic alternative to the theoretical schedule.

§ 9

Function Approximation

Tabular RL (one value per state) is impractical for large or continuous state spaces. Function approximation replaces the table with a parameterized function \(\hat{V}(s; \theta) \approx V^\pi(s)\).

Linear Function Approximation

Linear Value Function \[ \hat{V}(s; \theta) = \theta^\top \phi(s) \]

where \(\phi(s) \in \mathbb{R}^d\) is a feature vector. The TD(0) update with linear FA becomes:

Semi-gradient TD(0) with Linear FA \[ \theta_{t+1} = \theta_t + \alpha_t \underbrace{\left[ R_{t+1} + \gamma \theta_t^\top \phi(S_{t+1}) - \theta_t^\top \phi(S_t) \right]}_{\delta_t \text{ (TD error)}} \phi(S_t) \]

This is a stochastic approximation in \(\theta\)-space. The mean field becomes \(h(\theta) = \mathbb{E}[\delta_t \phi(S_t)]\), and the ODE method applies — provided we are on-policy (to avoid the deadly triad).

Why "Semi-Gradient"?

True gradient descent on \(\|\hat{V}(\cdot;\theta) - \mathcal{T}^\pi \hat{V}(\cdot;\theta)\|^2\) would require differentiating through the target \(\mathcal{T}^\pi \hat{V}\). Semi-gradient TD stops this gradient, treating the target as fixed. This is crucial for stability in the on-policy case.

Connection to SA Theory

With on-policy linear FA, the mean field can be written as \(h(\theta) = A\theta + b\) where:

A = \mathbb{E}\left[\phi(S_t)\left(\gamma\phi(S_{t+1}) - \phi(S_t)\right)^\top\right], \quad b = \mathbb{E}\left[R_{t+1}\phi(S_t)\right]

Under on-policy sampling, \(A\) has eigenvalues with negative real parts — ensuring the ODE \(\dot{\theta} = A\theta + b\) is stable and converges to the unique fixed point \(\theta^* = -A^{-1}b\).

§ 10

Modern RL: Deep RL as Large-Scale SA

Deep RL uses neural networks as function approximators. From an SA perspective, this is simply replacing linear \(\theta^\top \phi(s)\) with a deep network \(f_\theta(s)\). The theoretical landscape becomes much harder, but the SA intuition remains.

DQN and Its Instability Fixes

Deep Q-Network (Mnih et al., 2015) combines Q-learning with deep neural networks. Two critical tricks stabilize the SA dynamics:

🎯 Target Networks

A separate "target" network \(\theta^-\) is used for bootstrap targets, updated slowly. This prevents the "moving target" problem — it stabilizes the mean field \(h\) by fixing it for a period, approximating the SA assumption of a time-invariant target.

🔄 Experience Replay

Transitions are stored in a replay buffer and sampled randomly. This breaks temporal correlations in the Markov noise, approximately restoring the i.i.d. assumption needed for SA convergence. It also improves data efficiency.

Actor-Critic as Two-Timescale SA

Actor-critic methods maintain two parameterized functions:

Critic: \(V_w(s)\) — estimates the value function (parameter \(w\))
Actor: \(\pi_\theta(a|s)\) — the policy being learned (parameter \(\theta\))

Two-Timescale SA (Borkar, 1997; Konda & Tsitsiklis, 2000) \[ w_{t+1} = w_t + \beta_t \left[\delta_t \nabla_w V_w(S_t)\right] \quad \text{(fast timescale)} \] \[ \theta_{t+1} = \theta_t + \alpha_t \left[\delta_t \nabla_\theta \log \pi_\theta(A_t|S_t)\right] \quad \text{(slow timescale)} \] \[ \beta_t \gg \alpha_t \quad (\beta_t/\alpha_t \to \infty) \]

The key insight from two-timescale SA theory (Borkar, 1997): on the slow timescale, the fast iterate has already converged (approximately), so the slow iterate sees the converged fast value. This decouples the two updates and allows separate convergence analysis.

Modern A3C/PPO/SAC

All modern actor-critic algorithms (A3C, PPO, SAC, TD3) are variations of two-timescale SA, with sophisticated tricks to handle the deep network function approximation. The SA theoretical lens explains why they work and why they sometimes fail.

§ 11

Summary

We have built a complete picture of model-free RL, from intuition to deep theory. Here is the conceptual hierarchy:

🗺️ Conceptual Map

The Big Picture

Model-free RL is fundamentally about approximating the Bellman operator from samples. Monte Carlo does this with full-trajectory estimates (unbiased, high variance). TD does it with one-step estimates (biased, low variance). Both are instances of stochastic approximation — iterative algorithms that converge almost surely under the Robbins-Monro step-size conditions, as analyzed via the ODE method.

Key Takeaways

Concept	One-line summary
Model-Free RL	Learn value/policy directly from experience, no transition model needed
Monte Carlo	Average full-episode returns; unbiased but high variance
TD Learning	Bootstrap off one-step estimates; biased but efficient and online
Q-Learning / SARSA	TD control: off-policy / on-policy Q-value estimation + greedy improvement
SA recursion	\(\theta_{t+1} = \theta_t + \alpha_t(h(\theta_t) + \xi_t)\)
Robbins-Monro	\(\sum \alpha_t = \infty\), \(\sum \alpha_t^2 < \infty\) ensures convergence
ODE Method	SA trajectory approximates the flow of \(\dot{\theta} = h(\theta)\)
Deadly Triad	FA + bootstrapping + off-policy together cause instability / divergence
Deep RL	Large-scale SA with non-linear FA; target networks + replay restore stability