From policy gradients and variance reduction to two-timescale stochastic approximation — a rigorous, intuitive treatment for the RL researcher.
1. Motivation & Intuition
How does a human chess player improve? Not by enumerating all possible games — but by playing, getting feedback, and gradually refining both how they decide (strategy/policy) and how they evaluate positions (value judgment/critic).
Pure value-based methods (Q-learning) work for discrete actions but struggle with continuous spaces. Pure policy search methods (REINFORCE) are unbiased but extremely high variance. Actor–Critic methods combine both, getting the best of each world.
The key idea: maintain two learnable objects — a policy (the actor) that picks actions, and a value function (the critic) that evaluates them. The critic tells the actor how much better or worse its action was than expected, enabling lower-variance gradient estimates.
The RL objective is to maximize expected discounted return:
Two classes of methods: value-based (learn $Q^*$ then extract $\pi^*$ greedily) and policy-based (directly optimize $J(\theta)$ via gradient ascent). Actor–critic occupies the intersection — direct policy optimization guided by a learned value function.
2. Policy Gradient Basics
The core idea of policy gradients: if an action led to high reward, increase its probability; if it led to low reward, decrease it. We want to compute $\nabla_\theta J(\theta)$ — the direction in parameter space that increases expected return.
The magic of the log-derivative trick∇_θ π_θ(a|s) = π_θ(a|s) · ∇_θ log π_θ(a|s). Divides and multiplies by π_θ, turning an integral derivative into an expectation.: it lets us write the gradient as an expectation over trajectories — something we can sample.
The Policy Gradient Theorem (Sutton et al., 2000):
where $d^{\pi_\theta}(s)$ is the (discounted) stationary state distribution under $\pi_\theta$. The gradient is an expectation, so we can estimate it by sampling trajectories.
Monte Carlo estimate of PG — use full-trajectory returns $G_t = \sum_{k=t}^\infty \gamma^k r_{t+k}$ as the weight:
Unbiased but very high variance. This is the starting point that motivates actor–critic.
3. The Variance Problem in REINFORCE
Imagine training an agent on a game. If the agent wins, all actions in that game get a credit boost — even the bad ones that happened to precede good luck. The return $G_t$ lumps together hundreds of random events, creating extremely noisy gradient estimates.
High variance means: (1) we need exponentially more samples to get reliable estimates, (2) gradient steps can harm performance, (3) learning collapses in long-horizon or stochastic environments. This is the fundamental weakness of pure REINFORCE.
The return $G_t$ is the sum of many random rewards — its variance grows roughly as $O(T \cdot \sigma_r^2 / (1-\gamma)^2)$. For $\gamma = 0.99$ and long horizons, this is enormous.
The variance of the REINFORCE gradient estimator:
The first term grows with $\mathrm{Var}[G_t]$, which for a T-step discounted return is bounded by $\frac{r_{\max}^2 \cdot T}{(1-\gamma)^2}$. For large $T$ (or $\gamma \to 1$), this variance explodes.
Sample efficiency: To achieve $\varepsilon$-accurate gradient, REINFORCE requires $O(1/\varepsilon^2)$ samples, which is fine in principle but the constant is enormous in practice due to high variance.
4. Baselines and Advantage Functions
The key insight: we don't need the absolute return — we need to know if an action was better or worse than average. Subtract a baseline $b(s)$ from the return. If $b(s)$ approximates the expected return in state $s$, then actions better than average get a positive signal and worse-than-average get negative — much lower variance.
The optimal baseline (minimizing variance) is approximately $b(s) \approx V^{\pi_\theta}(s)$. This gives us the advantage functionA^π(s,a) = Q^π(s,a) - V^π(s). Measures how much better action a is than the average action in state s under policy π. Zero-centered, which dramatically reduces variance.:
For any baseline $b(s)$ that does not depend on $a$:
Baselines do not introduce bias! So we can subtract any state-dependent $b(s)$ without affecting the expected gradient.
This is unbiased (same as before) but lower variance. The advantage is zero-centered — it only amplifies policy changes when actions deviate from average.
The problem: we don't know $V^{\pi_\theta}(s)$ exactly. We need to learn it. And the natural way to learn a value function is TD learning. This necessity gives birth to actor–critic methods.
5. The Birth of Actor–Critic
The logical progression: REINFORCE has high variance → we need a baseline → the best baseline is $V^\pi$ → we need to learn $V^\pi$ → a separate learning component is needed → this component is the critic.
The policy $\pi_\theta(a|s)$. Parameterized by $\theta$. Selects actions. Updated slowly via policy gradient. Aims to maximize expected return. Uses critic's feedback.
The value function $V_w(s)$ (or $Q_w(s,a)$). Parameterized by $w$. Evaluates states (or state-actions). Updated faster via TD. Provides the baseline or advantage signal.
Instead of waiting until the end of an episode (Monte Carlo), the critic can give immediate feedback after each step via bootstrapping. This dramatically reduces variance at the cost of some bias — a favorable tradeoff in practice.
Architectural variants of actor–critic differ in what the critic learns:
| Critic Type | Learns | Advantage Estimate |
|---|---|---|
| State-value critic | $V_w(s)$ | $\delta_t = r_t + \gamma V_w(S_{t+1}) - V_w(S_t)$ (TD error) |
| Action-value critic | $Q_w(s,a)$ | $Q_w(s,a) - V_w(s)$ or $Q_w(s,a) - \mathbb{E}_{a'}Q_w(s,a')$ |
| Advantage critic | $A_w(s,a)$ directly | $A_w(s,a)$ (dueling architecture) |
6. Basic Actor–Critic Algorithm
At each timestep: (1) The actor picks an action using current policy. (2) Environment returns reward and next state. (3) Critic computes the TD error — how wrong was its value estimate? (4) Critic updates toward the correct value. (5) Actor updates using the TD error as a quality signal — positive TD error means "this was better than expected, do it more."
The TD error $\delta_t = r_t + \gamma V_w(S_{t+1}) - V_w(S_t)$ is an unbiased sample of the advantage $A^\pi(S_t, A_t)$ when the critic is correct — this is why it works as a gradient weight for the actor.
For t = 0, 1, 2, ... do
// Actor samples action
Aₜ ~ πθ(·|Sₜ)
// Environment step
Observe Rₜ, Sₜ₊₁
// Critic: compute TD error
δₜ ← Rₜ + γ · Vw(Sₜ₊₁) - Vw(Sₜ)
// Critic update (faster)
w ← w + βt · δₜ · ∇wVw(Sₜ)
// Actor update (slower)
θ ← θ + αt · δₜ · ∇θlog πθ(Aₜ|Sₜ)
Sₜ ← Sₜ₊₁
7. TD Learning as the Critic
TD(0)Temporal Difference learning with 0-step lookahead. Updates based on one-step bootstrap: V(s) ← V(s) + α[r + γV(s') - V(s)]. Learns online, step-by-step. is the simplest critic: after each step, update the value estimate by moving it a bit toward the bootstrapped target $r + \gamma V(s')$. The TD error $\delta = r + \gamma V(s') - V(s)$ measures the "surprise" — how much better or worse the outcome was than expected.
| Method | Bias | Variance | Data needed |
|---|---|---|---|
| Monte Carlo ($G_t$) | None | Very high | Full episode |
| TD(0) | Low (from bootstrap) | Low | 1 step |
| TD(n) | Decreasing | Increasing | n steps |
| TD(λ) | Tunable | Tunable | Online |
TD(0) minimizes the Mean Squared Bellman Error (MSBE) when combined with linear function approximation $V_w(s) = w^\top \phi(s)$:
The semi-gradient TD(0) update:
Under linear FA and on-policy sampling, TD(0) converges to $w^* = A^{-1}b$ where $A = \mathbb{E}[\phi(\phi - \gamma\phi')^\top]$ and $b = \mathbb{E}[r\phi]$. This is the TD fixed point, not the true MSBE minimizer — it introduces a small bias.
8. Advantage Actor–Critic (A2C)
A2C (synchronous Advantage Actor-Critic, 2016) makes the advantage estimate explicit and clean. The actor gradient uses the advantage $\hat{A}_t$ rather than the raw TD error or return. Multiple parallel workers collect experience, stabilizing gradients.
Schulman et al. (2015) proposed a $\lambda$-weighted blend of n-step advantages:
$$\hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$$\lambda=0$: TD(0), $\lambda=1$: Monte Carlo. $\lambda$ controls bias–variance.
Add an entropy bonus to encourage exploration and prevent premature convergence:
$$\theta \leftarrow \theta + \alpha \nabla_\theta \left[\hat{A}_t \log \pi_\theta + c \cdot H(\pi_\theta)\right]$$$H(\pi) = -\sum_a \pi(a|s)\log\pi(a|s)$. Standard in A3C, SAC, PPO.
where $\hat{R}_t = \sum_{k=0}^{T-1} \gamma^k r_{t+k} + \gamma^T V_w(S_{t+T})$ is the n-step return target. The critic minimizes MSE toward this target; the actor maximizes advantage-weighted log-probability.
9. The Stochastic Approximation View
Stochastic approximation (Robbins–Monro, 1951) provides the theoretical backbone for all iterative RL algorithms. The general form is:
where $f(x) = 0$ defines the target (fixed point) and noise is zero-mean. Under appropriate conditions on step sizes $\alpha_t$ and $f$, $x_t \to x^*$.
Both the actor and critic updates are stochastic approximation recursions — noisy gradient steps. The question is: can both converge simultaneously? The answer requires separation of timescales.
Robbins–Monro conditions for convergence:
Typical schedules: $\alpha_t = c/(t+1)$ satisfies both. The first condition ensures enough "movement" to reach the target; the second ensures updates shrink enough to converge (not oscillate).
Actor–critic involves two coupled stochastic approximation recursions (actor: $\theta_t$, critic: $w_t$). This coupling is the core theoretical challenge, resolved by the two-timescale framework.
10. Two-Timescale Stochastic Approximation
This section contains the rigorous theoretical backbone of actor–critic convergence — arguably the most important mathematical result in the field.
Consider two people adjusting two coupled knobs. If one person adjusts their knob much faster than the other, then from the slow person's perspective, the fast knob has already settled to its best position given the current slow-knob value. From the fast person's perspective, the slow knob barely moves.
Critic (fast): Updates rapidly with large step size $\beta_t \gg \alpha_t$. Given the current actor $\theta$, the critic converges to $w^*(\theta)$ — the best value estimate for that policy.
Actor (slow): Updates slowly. From its perspective, the critic is always converged. So the actor gradient uses an accurate (near-optimal) advantage estimate at each step.
This timescale separation means: even though both are updating simultaneously, theoretically we can analyze them as if the critic were solved first. This is what makes the coupled convergence proof tractable.
10.1 Formal Setup
The two coupled update recursions are:
where $\xi_t, \zeta_t$ are noise terms, $h = \nabla_\theta J_w(\theta)$ (policy gradient using critic), and $g = \delta_t \nabla_w V_w$ (TD update).
Equivalently: $\beta_t \gg \alpha_t$ (critic step size dominates actor step size). Both must satisfy Robbins–Monro individually.
Example schedules satisfying this:
10.2 ODE Interpretation (Borkar–Meyn Framework)
Fast Timescale ODE (Critic)
Viewing $w_t$ on its own timescale (stretch time by $1/\beta_t$), the fast iterate tracks:
With $\theta$ frozen (it moves infinitely slower), the critic ODE has a globally attracting equilibrium $w^*(\theta)$ for each $\theta$. The critic converges to this manifold.
Slow Timescale ODE (Actor)
On the actor's timescale, the critic appears converged. The actor tracks:
This is the "true" policy gradient ODE, with advantages computed using the converged critic $w^*(\theta)$.
Convergence Conclusion
Under technical conditions (Lipschitz dynamics, bounded iterates or stability, ergodicity of the Markov chain), the Borkar–Meyn theorem gives:
where $\theta^*$ is a local maximum of $J(\theta)$ (a stationary point of the actor ODE). The actor converges to a (local) optimal policy.
Why this matters: Without two-timescale analysis, we cannot guarantee actor–critic convergence — the critic bias would compound. This framework justifies using $\beta_t \gg \alpha_t$ in practice and underpins convergence proofs in A2C, TD-actor-critic, and natural gradient methods.
Interactive: Two-Timescale Dynamics
11. Deterministic Policy Gradient
Stochastic policies work well for discrete actions but are inefficient in high-dimensional continuous action spaces — you'd need to sample many actions to estimate the gradient. Silver et al. (2014) showed that for deterministic policies $\mu_\theta: \mathcal{S} \to \mathcal{A}$, a simpler gradient formula exists.
Intuition: the gradient just points in the direction of action that the Q-function says is best. Instead of averaging over random actions, we follow the Q-gradient in action space.
The gradient of $J$ w.r.t. $\theta$ is the chain rule: how action changes with $\theta$, times how Q-value changes with action. No need to average over actions. This requires off-policy training with an exploration policy $\beta \neq \mu_\theta$.
DPG is the limit of the stochastic policy gradient as policy variance $\sigma \to 0$. For a Gaussian policy $\pi_{\theta,\sigma}(a|s) = \mathcal{N}(\mu_\theta(s), \sigma^2 I)$:
So DPG is not a fundamentally different theorem — it's a special case of the PG theorem in the deterministic limit.
12. Deep Actor–Critic Methods
12.1 DDPG (Deep Deterministic Policy Gradient)
DPG + neural networks + replay buffer + target networks (Lillicrap et al., 2015). Key innovations: (1) off-policy experience replay for sample efficiency, (2) target networks — slowly-updated copies — for training stability, (3) batch normalization for feature scaling.
- Replay Buffer: $\mathcal{B}$ stores $(s,a,r,s')$; random mini-batches break correlations
- Target Networks: $\theta^-, w^-$ — updated as $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ ($\tau \ll 1$)
- Ornstein-Uhlenbeck noise: Temporally correlated exploration noise for continuous actions
12.2 TD3 (Twin Delayed Deep Deterministic)
DDPG suffers from Q-value overestimation. TD3 introduces three fixes:
- Clipped Double Q-learning: Maintain two critics $Q_{w_1}, Q_{w_2}$, use $\min$ for target: $y = r + \gamma \min_i Q_{w_i^-}(s', \tilde{a})$
- Delayed Policy Updates: Update actor every $d$ critic steps (typically $d=2$) — critic must first stabilize
- Target Policy Smoothing: Add noise to target actions: $\tilde{a} = \mu_{\theta^-}(s') + \text{clip}(\varepsilon, -c, c)$
12.3 SAC (Soft Actor–Critic)
SAC (Haarnoja et al., 2018) is the current gold standard for continuous control. Its key innovation: maximize both return and policy entropy simultaneously (maximum entropy RL framework). This naturally encourages exploration and robustness.
Intuition: an agent that maximizes entropy tries to be as random as possible while still achieving high reward. This prevents premature commitment to suboptimal solutions and gives a natural exploration-exploitation balance.
SAC augments the RL objective with an entropy term:
The temperature $\alpha$ balances reward vs. entropy. SAC uses automatic entropy tuning: $\alpha$ is optimized to maintain $H(\pi) \geq \bar{H}$ (a target entropy threshold).
| Method | Policy | Off-policy? | Key Innovation | Best for |
|---|---|---|---|---|
| DDPG | Deterministic | Yes | DPG + NN + replay | Baseline continuous control |
| TD3 | Deterministic | Yes | Clipped double Q, delay | When overestimation is critical |
| SAC | Stochastic | Yes | Max-entropy, auto-α | State-of-the-art robustness |
| PPO | Stochastic | No (near) | Clipped surrogate objective | Simplicity, robustness |
13. Modern Perspectives
Conservative Q-Learning (CQL), IQL: actor–critic with pessimistic value estimates from offline datasets. The critic penalizes OOD actions to prevent overestimation without online exploration.
MPO, VMPO: actor–critic with trust-region constraints. Policy improvement is a KL-constrained optimization. Related to PPO's clipped surrogate and SAC's entropy bonus.
CURL, DrQ, DreamerV3: use auxiliary losses (contrastive, reconstruction) to learn better state representations. The critic benefits most from rich feature representations.
Modern LLM alignment (RLHF, PPO from OpenAI/Anthropic) is actor–critic at scale: the language model is the actor, a reward model is the proxy for the critic signal, and PPO performs policy updates with KL constraints against the reference model. The two-timescale intuition still applies: reward model is trained first, then the policy is optimized.
14. Theoretical Insights & Challenges
14.1 The Deadly Triad
Instability and divergence occur when all three are present simultaneously:
Neural networks for V or Q
TD learning (using own estimates)
Experience replay, behavior policy
Actor–critic with replay buffers hits all three. Solutions: gradient clipping, target networks, conservative critics, double Q-learning, periodic resets.
14.2 Non-Stationarity
As the actor updates, the data distribution changes — the critic is learning a moving target. This makes off-policy corrections (importance sampling) theoretically necessary but practically unstable. Most algorithms ignore this and rely on near-on-policy experience in the replay buffer.
14.3 Convergence Guarantees (Summary)
| Algorithm | Convergence | Conditions |
|---|---|---|
| TD(0) + Linear FA | Convergent to TD fixed point | On-policy, linear FA, Robbins–Monro |
| Linear AC (2-timescale) | Convergent to local optimum | Linear critic, two-timescale, ergodic MDP |
| DDPG/TD3 (deep) | No formal guarantee | Empirically stable with tricks |
| Natural Policy Gradient | Convergent (monotone) | Exact Fisher, tabular or linear |
| PPO | Monotone improvement (approx.) | Trust region approximation |
The natural policy gradient (Amari, 1998; Kakade, 2002) preconditions the actor update with the Fisher information matrix:
This is gradient ascent in the space of distributions (under KL geometry) rather than parameter space. The compatible function approximation critic (Sutton et al. 2000) gives: $w^* = F^{-1} \nabla J$ when $Q_w(s,a) = w^\top \nabla\log\pi_\theta(a|s)$. TRPO and PPO are practical approximations of NPG.
15. Summary
- Policy $\pi_\theta$ or $\mu_\theta$
- Updated by policy gradient
- Slow timescale ($\alpha_t$)
- Uses critic's advantage signal
- Value function $V_w$ or $Q_w$
- Updated by TD learning
- Fast timescale ($\beta_t \gg \alpha_t$)
- Reduces actor gradient variance
| Equation | Role |
|---|---|
| $\nabla_\theta J = \mathbb{E}[\nabla\log\pi_\theta \cdot Q^\pi]$ | Policy gradient theorem |
| $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ | Advantage (variance reduction) |
| $\delta_t = r_t + \gamma V_w(s') - V_w(s)$ | TD error (advantage estimate) |
| $w \mathrel{+}= \beta_t \delta_t \nabla_w V_w(s)$ | Critic update (TD) |
| $\theta \mathrel{+}= \alpha_t \delta_t \nabla_\theta \log\pi_\theta(a|s)$ | Actor update (policy gradient) |
| $\alpha_t / \beta_t \to 0$ | Two-timescale condition |
| $\nabla_\theta J(\mu) = \mathbb{E}[\nabla_\theta\mu \cdot \nabla_a Q]$ | Deterministic policy gradient |
16. Historical Timeline
Click any entry to expand.
References: Sutton & Barto (2018) Reinforcement Learning: An Introduction · Konda & Tsitsiklis (2003) On Actor-Critic Algorithms · Silver et al. (2014) Deterministic Policy Gradient · Haarnoja et al. (2018) Soft Actor–Critic · Borkar (2008) Stochastic Approximation: A Dynamical Systems Viewpoint · Schulman et al. (2015) High-Dimensional Continuous Control using GAE