Function Approximation
in Reinforcement Learning
§1 Motivation & Intuition
The Curse of Dimensionality
Classical tabular RL — Q-tables, V-tables — represents the value function as a lookup table with one entry per state (or state-action pair). This is elegant and convergence-guaranteed, but collapses catastrophically as state spaces grow.
Imagine a robot navigating a room. If you discretize position into a 100×100 grid, that's 10,000 states — manageable. Add orientation (360 degrees), joint angles for each of 7 DOF, and object states, and you quickly reach states numbering in the trillions. You could never visit them all, let alone store a Q-value for each.
The curse of dimensionality says that the volume of space grows exponentially with dimension. If each feature has \(k\) values, then \(d\) features require \(k^d\) table entries. For \(k=10, d=20\), that's \(10^{20}\) — more than the number of atoms in your body.
Suppose the state space is a grid over \(d\) continuous variables, each discretized into \(k\) bins. The table size is:
For \(k=100, d=10\): \(10^{20}\) entries. Even at 1 byte per entry, storage exceeds \(10^8\) terabytes. Moreover, sample complexity scales proportionally — we need \(\Omega(|\mathcal{S}|)\) samples simply to visit every state.
Why Tabular RL Fails
Three fundamental failures of tabular RL in real-world problems:
The Need for Generalization
The key insight is that nearby states should have similar values. A robot 1mm to the left vs 1mm to the right of a door threshold has nearly the same long-term reward potential. Function approximation exploits this smoothness by representing \(V\) or \(Q\) as a parametric function — learning generalizes across the entire state space from a subset of samples.
§2 Formal Setup
The MDP Framework
We work within a Markov Decision Process \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)\).
The Objects We Approximate
Function approximation can target three fundamental objects:
State Value Function V(s; θ): How good is it to be in state \(s\)? This estimates the expected total discounted reward from state \(s\) under a given policy. Useful for policy evaluation and improvement.
Action-Value Function Q(s,a; θ): How good is it to take action \(a\) in state \(s\)? This allows greedy policy extraction without knowing the model.
Policy π(a|s; θ): Directly represent the policy as a parametric distribution over actions. Allows for smooth, stochastic policies which can be optimized via gradient ascent.
§3 Function Classes
Linear Function Approximation
The most theoretically tractable class. The approximated value function is linear in the feature representation of states:
where \(\phi: \mathcal{S} \to \mathbb{R}^d\) is a feature map and \(\theta \in \mathbb{R}^d\) are learnable weights. The key property: the function is linear in \(\theta\), not necessarily in \(s\) — the features \(\phi(s)\) can be highly nonlinear.
Convergence guarantees with TD-learning (under on-policy). Unique global minimum (convex optimization). Efficient computation — only \(d\) parameters. Clear geometric interpretation as projection onto the linear function space.
Representational capacity is fundamentally limited — the true value function may not be linearly realizable. Performance is heavily dependent on feature engineering. Cannot model interaction effects without explicitly including them.
Basis Functions
Polynomial Basis
For \(s \in \mathbb{R}^n\), polynomial features: \(\phi(s) = [1, s_1, s_2, \ldots, s_1^2, s_1 s_2, \ldots]\). Degree-\(k\) polynomials give \(\binom{n+k}{k}\) features — still exponential in \(n\).
Fourier Basis
Periodic features that capture oscillatory structure:
Radial Basis Functions (RBF)
Gaussian bumps centered at prototype states \(\{s_1, \ldots, s_d\}\):
Neural Networks
Neural networks are universal function approximators — given sufficient width/depth, they can approximate any continuous function on a compact domain to arbitrary precision (Universal Approximation Theorem).
In practice, neural networks provide extraordinary expressiveness but come with optimization challenges: non-convex loss landscapes, saddle points, and — critically in RL — non-stationarity of training targets.
Unlike supervised learning, RL training targets (e.g., Bellman targets) are themselves functions of the current network weights. This creates a moving-target problem that can destabilize learning. The three elements of the Deadly Triad amplify this.
§4 Value Function Approximation
The Projection View
Let \(\mathcal{F} = \{\hat{V}(\cdot;\theta) : \theta \in \mathbb{R}^d\}\) be our function class. For linear FA, \(\mathcal{F}\) is a \(d\)-dimensional linear subspace of the space of all functions \(\mathcal{S} \to \mathbb{R}\).
Think of the space of all possible value functions as a very high-dimensional space. Our parameterized function class \(\mathcal{F}\) is a low-dimensional manifold (or subspace, for linear FA) within this space. The true value function \(V^\pi\) lives somewhere in this ambient space, often outside \(\mathcal{F}\).
The best we can do is find the point in \(\mathcal{F}\) closest to \(V^\pi\) — this is the projection \(\Pi V^\pi\). The distance between \(V^\pi\) and \(\Pi V^\pi\) is the approximation error, which we cannot reduce by training — it's the price we pay for using a restricted function class.
Define the \(\mu\)-weighted \(L^2\) norm: \(\|V\|_\mu^2 = \sum_s \mu(s) V(s)^2\), where \(\mu\) is a state distribution (often the on-policy distribution).
The projection operator \(\Pi\) onto the linear function space \(\mathcal{F}\) is:
where \(\Phi \in \mathbb{R}^{|\mathcal{S}| \times d}\) is the feature matrix and \(D = \text{diag}(\mu)\). This is a weighted least-squares projection.
The mean squared value error is:
The black curve is the true value function \(V^\pi(s)\). The blue line is the best linear approximation \(\Pi V^\pi\).
The red shaded region shows the approximation error — the component of \(V^\pi\) orthogonal to the function class.
No amount of training can reduce this error. It's the bias of the function class.
The Bellman Operator
The Bellman operatorThe Bellman operator 𝒯 maps value functions to value functions. Its fixed point is the true value function V*. Key property: it is a γ-contraction under the max-norm. \(\mathcal{T}^\pi\) is central to all RL algorithms. For policy \(\pi\):
Key property: \(\mathcal{T}^\pi\) is a \(\gamma\)-contraction under the \(L^\infty\) norm, meaning repeated application converges to \(V^\pi\):
The Deadly Triad
Sutton and Barto identified three conditions whose simultaneous presence leads to instability and divergence in RL with function approximation.
Any two elements together are safe. All three together can cause divergence — the approximated values grow without bound, even for simple MDPs (Baird's Counterexample). This is not a bug but a fundamental theoretical limitation.
Baird (1995) constructed a simple 7-state MDP where semi-gradient TD with linear FA and off-policy sampling diverges. The state space has 7 states, linear features, and a specific off-policy behavior policy.
Let \(\phi(s_i) = e_i\) (standard basis), except \(\phi(s_7) = 2\mathbf{1}\). The target policy transitions always to \(s_7\), but the behavior policy transitions uniformly at random. Under these conditions, the TD update matrix has eigenvalues with positive real parts, causing exponential growth of \(\|\theta_t\|\).
Formally, the TD update is \(\theta_{t+1} = \theta_t + \alpha \delta_t \phi(s_t)\), where \(\delta_t\) is the TD error. The expected update in matrix form is \(\mathbb{E}[\theta_{t+1}] = (I - \alpha A)\theta_t + \alpha b\), where \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\). When \(A\) is not positive definite (which happens with off-policy \(D\)), the iteration diverges.
§5 Core Algorithms
TD(0) with Function Approximation
TD(0) is the simplest temporal difference method. After each transition \((s, a, r, s')\), we update our value estimate: the prediction for \(s\) should be consistent with the reward we received plus the discounted prediction for \(s'\). We're essentially doing stochastic gradient descent toward a bootstrap target.
The "temporal difference" is the error between our current prediction and this bootstrap target. It's like learning that if you know the value of tomorrow, you can update today's value estimate with just a one-step look-ahead, without needing the full episode return.
The TD target is \(y_t = r_t + \gamma \hat{V}(s_{t+1};\theta_t)\). The TD error is:
If we were doing true gradient descent on \(\overline{VE}(\theta)\), the gradient w.r.t. \(\theta\) would include a term from differentiating the target, which we drop. This gives the semi-gradient update:
For linear FA, \(\nabla_\theta \hat{V}(s;\theta) = \phi(s)\), so: \(\theta_{t+1} = \theta_t + \alpha \delta_t \phi(s_t)\)
TD(0) is NOT a true gradient method. The target \(r + \gamma \hat{V}(s';\theta)\) depends on \(\theta\), so the true gradient of the loss includes \(-\gamma \nabla_\theta \hat{V}(s';\theta)\). TD drops this term, treating the target as fixed. This makes the update biased but stable (under on-policy conditions).
Convergence of Semi-Gradient TD
Under on-policy sampling, linear function approximation, and step-sizes satisfying the Robbins-Monro conditions (\(\sum \alpha_t = \infty\), \(\sum \alpha_t^2 < \infty\)), TD(0) converges almost surely to \(\theta^*\) where:
This is a bounded approximation error result — TD finds a solution within a \(\frac{1}{1-\gamma}\) factor of the best possible in the function class.
The expected TD update direction is \(\mathbb{E}[\delta_t \phi(s_t)] = -A\theta + b\) where:
where \(D = \text{diag}(\mu)\) with \(\mu\) the on-policy stationary distribution, and \(P^\pi\) the policy transition matrix.
Under on-policy sampling, \(D\) satisfies \(\mu^\top P^\pi = \mu^\top\), making \(\mu\) a stationary distribution. One can show \(A\) is positive definite (PD), which is the key condition for stability.
The ODE \(\dot{\theta} = -A\theta + b\) has a unique stable fixed point \(\theta^* = A^{-1}b\). By the ODE method (Benaïm, Borkar), stochastic approximation convergence follows. The bound follows from the projection error analysis.
Gradient TD Methods (GTD2, TDC)
To address off-policy divergence, Sutton et al. (2009) developed true gradient methods that converge under off-policy sampling by minimizing the projected Bellman errorThe Projected Bellman Error (PBE) is ‖ΠTᵖV − V‖²_μ. Unlike VE, its gradient includes terms from differentiating the TD target, making it a true gradient..
The insight behind GTD is to minimize the Projected Bellman Error (PBE) rather than the Mean Squared TD Error. PBE has a true gradient that can be computed efficiently. The trick is to introduce a secondary weight vector \(w\) that helps estimate the gradient direction. This doubles the number of parameters but enables principled off-policy learning.
The objective is the Projected Bellman Error (PBE):
GTD2: maintains secondary weights \(w\) and updates:
where \(\rho_t = \pi(a_t|s_t)/b(a_t|s_t)\) is the importance sampling ratio for off-policy correction.
Fitted Value Iteration
An alternative: treat value estimation as a supervised regression problem. In each iteration, use the current value function to generate targets, then fit a new function.
§6 Q-Learning with Function Approximation
Semi-Gradient Q-Learning
Q-learning directly estimates the optimal Q-function \(Q^*\). With function approximation:
Deep Q-Network (DQN)
Mnih et al. (2015) made Q-learning with neural networks work in practice via two key tricks:
DQN Variants
| Algorithm | Key Innovation | Addresses |
|---|---|---|
| Double DQN | Separate networks for action selection and evaluation | Overestimation bias in max operator |
| Dueling DQN | Split stream for V(s) and A(s,a) = Q(s,a) - V(s) | Inefficient value estimation |
| Prioritized ER | Sample high-TD-error transitions more often | Sample efficiency |
| Rainbow | Combination of above + distributional RL + noisy nets | Multiple biases simultaneously |
| C51 / QR-DQN | Learn full return distribution, not just expectation | Risk-sensitivity, better gradients |
§7 Policy Gradient & Actor-Critic
REINFORCE
Instead of learning a value function and deriving a policy, directly optimize the policy parameters. The objective is the expected return \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)]\). We want to perform gradient ascent on \(J(\theta)\).
The trick: we can write the gradient of \(J\) as an expectation under the policy itself, which we can estimate by collecting rollouts. Actions that led to high returns get their probabilities increased; actions with low returns get decreased.
The Policy Gradient Theorem (Sutton et al., 2000):
REINFORCE uses the Monte Carlo return \(G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}\) as an unbiased estimate of \(Q^{\pi_\theta}(s_t, a_t)\):
Adding a baseline \(b(s)\) reduces variance without introducing bias:
Actor-Critic Methods
Replace the Monte Carlo return with a bootstrapped critic estimate to reduce variance:
where \(\hat{A}(s,a) = r + \gamma \hat{V}(s';\omega) - \hat{V}(s;\omega)\) is the advantage estimate. Two sets of parameters: \(\theta\) for the actor (policy), \(\omega\) for the critic (value function).
To avoid introducing bias from the critic, use a critic that is "compatible" with the actor: \(\hat{Q}(s,a;\omega) = \omega^\top \nabla_\theta \log \pi_\theta(a|s)\). Then the policy gradient theorem holds exactly with this critic. This motivates the Natural Policy Gradient and Fisher information matrix preconditioning.
PPO, TRPO, SAC
| Algorithm | Key Idea | Stability Mechanism |
|---|---|---|
| TRPO | Trust-region constraint on policy update size | KL divergence constraint |
| PPO | Clipped surrogate objective, simpler than TRPO | Ratio clipping: clip(ρ, 1-ε, 1+ε) |
| SAC | Max entropy RL — maximize return + policy entropy | Automatic temperature tuning |
| TD3 | Clipped double Q-learning for continuous actions | Delayed policy updates, target smoothing |
§8 Approximate Dynamic Programming
Fitted Q-Iteration (FQI)
FQI is the batch analog of Q-learning. Given a fixed dataset \(\mathcal{D}\), it iteratively applies the Bellman optimality operator and projects back onto the function class:
Each FQI iteration introduces approximation error \(\epsilon_k\). These errors accumulate. Munos (2003, 2005) showed that the final policy's suboptimality is bounded by \(\frac{C \gamma}{(1-\gamma)^2} \max_k \epsilon_k\), where \(C\) is a concentrability coefficient that captures distributional mismatch.
Batch RL
Batch RL encompasses methods that learn from a fixed dataset without additional interaction. Key algorithms include:
- Fitted Q-Iteration (FQI): Regression-based Bellman iteration
- LSTD (Least-Squares TD): Direct solution to the linear TD fixed-point equation
- LSPI (Least-Squares Policy Iteration): Policy iteration using LSTD for evaluation
- NFQ (Neural Fitted Q-Iteration): FQI with neural network function approximator
LSTD: Direct Solution
For linear FA, the TD fixed-point satisfies \(A\theta = b\). LSTD estimates \(A\) and \(b\) from data and solves directly:
LSTD is data-efficient (no learning rate) but \(\mathcal{O}(d^2)\) in memory and \(\mathcal{O}(d^3)\) in computation per update. For large feature dimensions, LSTD(λ) or recursive variants are needed.
§9 Offline RL Connection
The Offline RL Problem
Offline (or batch) RL aims to learn the best possible policy from a fixed dataset \(\mathcal{D} = \{(s, a, r, s')\}\) collected by some (possibly unknown) behavior policy \(\mu\), without further environment interaction.
Online RL can query the environment to resolve uncertainty. Offline RL cannot. If the dataset lacks coverage of certain state-action regions, the learned Q-function will have uncontrolled extrapolation error in those regions — and greedy policy extraction will exploit these errors catastrophically.
Distribution Shift
The learned policy \(\pi_\theta\) may visit state-action pairs not in the training distribution \(\mu\). Under the learned policy, the true Q-values in out-of-distribution (OOD) regions are unknown, but the model may assign arbitrarily high values there.
Think of the Q-function as a model that's only been trained on some regions of the state-action space. In those regions, it's accurate. In untrained regions, it extrapolates — and neural networks can extrapolate wildly. When we greedily extract a policy, it finds the highest-Q action, which may be in a completely unconstrained OOD region.
This is the "deadly combination" for offline RL: function approximation enables extrapolation, and greedy action selection actively seeks out the highest-valued extrapolations.
Let \(d^\pi\) be the state-action distribution under policy \(\pi\), and \(d^\mu\) under the behavior policy. The concentrability coefficient is:
Suboptimality of an offline-learned policy is bounded by Munos (2003):
When \(d^\pi \not\ll d^\mu\) (policy explores OOD regions), \(C^\pi_\mu = \infty\) and the bound is vacuous.
Offline RL Solutions
| Approach | Algorithm | Mechanism |
|---|---|---|
| Policy Constraint | BCQ, BEAR | Constrain \(\pi\) to stay close to \(\mu\) |
| Value Pessimism | CQL, IQL | Penalize OOD Q-values during training |
| Model-Based | MOPO, MOREL | Use model uncertainty as reward penalty |
| Implicit Constraint | TD3+BC, DT | Behavioral cloning regularization |
Conservative Q-Learning (CQL)
CQL (Kumar et al., 2020) adds a regularization term to the standard Q-learning objective that pushes down Q-values for OOD actions while maintaining accuracy for in-distribution ones:
§10 Theory
Approximation vs Estimation Error
The total error in value function learning decomposes into two components:
Bellman Error Minimization
The Bellman residual or Bellman error at a function \(f\) is:
Minimizing BE directly is appealing but faces the double sampling problem: the gradient of BE requires two independent samples from the same state, which is typically unavailable in sequential data.
The gradient of BE w.r.t. \(\theta\) is:
Expanding: \(\nabla_\theta \mathcal{T}^\pi \hat{V}(s;\theta) = \gamma \mathbb{E}_{s'}[\nabla_\theta \hat{V}(s';\theta)]\). Estimating this requires a separate \(s'\) sample from the same \(s\), independent of the one used to estimate \(\mathcal{T}^\pi \hat{V}\). In online RL, we observe only one transition from each state, causing a biased estimator. GTD methods sidestep this via the PBE objective.
Stability: Positive Definiteness of A
For linear TD, the key stability condition is that the matrix \(A = \Phi^\top D(I - \gamma P^\pi)\Phi\) be positive definite. This holds when:
- On-policy: \(D\) is the stationary distribution of \(\pi\), which satisfies a balance condition making \(A\) PD.
- Off-policy: No guarantee. \(A\) can have negative eigenvalues, causing divergence (Baird's example).
Concentrability Coefficients
Munos (2003, 2005) formalized how the mismatch between training distribution \(\mu\) and target distribution affects error propagation in ADP. The concentrability coefficient:
measures how quickly the policy occupancy can grow relative to the data distribution. When data lacks coverage, \(c(m)\) is large, and approximation errors compound over policy iterations.
where \(C = \sum_{m=1}^\infty m \gamma^m c(m)\) accumulates all concentrability over the horizon, \(\epsilon_k\) is the regression error per iteration, and \(\hat{\pi}_K\) is the greedy policy w.r.t. \(Q^K\).
Neural Tangent Kernel (NTK) Perspective
For overparameterized neural networks, training dynamics in the lazy training regime are governed by the Neural Tangent Kernel \(K(s, s') = \nabla_\theta f_\theta(s)^\top \nabla_\theta f_\theta(s')\). This kernel remains approximately constant during training, effectively reducing to linear FA in feature space defined by the network Jacobian.
§11 Modern Advances
Deep RL Scaling
Modern deep RL systems (AlphaZero, MuZero, Gato, RT-2) show that scale matters enormously. Larger networks, more diverse training environments, and longer training lead to emergent capabilities not seen in smaller systems. Key insights:
- Transformer-based policies (Decision Transformer, Gato) treat RL as sequence modeling
- World models (DreamerV3) enable efficient learning from raw pixels via imagined rollouts
- Foundation model pre-training for RL initialization
Representation Learning in RL
The quality of the state representation \(\phi(s)\) fundamentally determines the difficulty of function approximation. Modern approaches to learning representations:
Offline RL Progress
Offline RL has seen dramatic progress since 2020, enabling learning from large pre-collected datasets:
- IQL (Implicit Q-Learning): Avoids evaluating OOD actions entirely via expectile regression
- Decision Transformer: Reframes offline RL as conditional sequence modeling
- Offline-to-Online: Fine-tuning offline-learned policies with online interaction
- Dataset scaling: D4RL, RoboMimic, Bridge Data — larger offline datasets enable better policies
Implicit Regularization in Neural FA
Recent work shows that gradient descent on neural networks in RL has implicit regularization effects. Overparameterized networks may converge to minimum-norm solutions, and the inductive biases of specific architectures (CNNs for spatial tasks, Transformers for sequential tasks) can be understood through the lens of function approximation.
§12 Historical Timeline
Summary
Function approximation is the bridge between tabular RL theory and real-world applicability. The core tension: richer function classes reduce approximation error but increase estimation error and destabilize training. The Deadly Triad (FA + bootstrapping + off-policy) is the central challenge. Modern deep RL navigates this through architectural choices (target networks, replay), algorithmic design (CQL, PPO), and implicit regularization. Theory increasingly lags practice, especially for neural FA, making this an active and exciting research frontier.
References
-
[1]Reinforcement Learning: An Introduction (2nd ed.)
-
[2]An Analysis of Temporal-Difference Learning with Function Approximation
-
[3]Error Bounds for Approximate Policy Iteration
-
[4]Fast Gradient-Descent Methods for Temporal-Difference Learning
-
[5]Human-Level Control through Deep Reinforcement Learning
-
[6]Reinforcement Learning: Theory and Algorithms
-
[7]Conservative Q-Learning for Offline Reinforcement Learning
-
[8]Offline Reinforcement Learning: Tutorial, Review, and Perspectives
-
[9]Policy Gradient Methods for Reinforcement Learning with Function Approximation
-
[10]Mastering Atari, Go, Chess and Shogi without Rules (MuZero)