Actor Critics Methods in Reinforcement Learning

From policy gradients and variance reduction to two-timescale stochastic approximation — a rigorous, intuitive treatment for the RL researcher.

Policy Gradients TD Learning Two-Timescale SA DDPG · TD3 · SAC Convergence Theory

1. Motivation & Intuition

How does a human chess player improve? Not by enumerating all possible games — but by playing, getting feedback, and gradually refining both how they decide (strategy/policy) and how they evaluate positions (value judgment/critic).

💡 The Central Tension

Pure value-based methods (Q-learning) work for discrete actions but struggle with continuous spaces. Pure policy search methods (REINFORCE) are unbiased but extremely high variance. Actor–Critic methods combine both, getting the best of each world.

The key idea: maintain two learnable objects — a policy (the actor) that picks actions, and a value function (the critic) that evaluates them. The critic tells the actor how much better or worse its action was than expected, enabling lower-variance gradient estimates.

The RL objective is to maximize expected discounted return:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^\infty \gamma^t r_t\right] = \mathbb{E}_{s_0 \sim \rho_0}\!\left[V^{\pi_\theta}(s_0)\right]$$

Two classes of methods: value-based (learn $Q^*$ then extract $\pi^*$ greedily) and policy-based (directly optimize $J(\theta)$ via gradient ascent). Actor–critic occupies the intersection — direct policy optimization guided by a learned value function.

2. Policy Gradient Basics

The core idea of policy gradients: if an action led to high reward, increase its probability; if it led to low reward, decrease it. We want to compute $\nabla_\theta J(\theta)$ — the direction in parameter space that increases expected return.

The magic of the log-derivative trick∇_θ π_θ(a|s) = π_θ(a|s) · ∇_θ log π_θ(a|s). Divides and multiplies by π_θ, turning an integral derivative into an expectation.: it lets us write the gradient as an expectation over trajectories — something we can sample.

The Policy Gradient Theorem (Sutton et al., 2000):

Policy Gradient Theorem
$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta(\cdot|s)}\!\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]$$

where $d^{\pi_\theta}(s)$ is the (discounted) stationary state distribution under $\pi_\theta$. The gradient is an expectation, so we can estimate it by sampling trajectories.

⚙️ REINFORCE (Williams, 1992)

Monte Carlo estimate of PG — use full-trajectory returns $G_t = \sum_{k=t}^\infty \gamma^k r_{t+k}$ as the weight:

$$\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot G_t$$

Unbiased but very high variance. This is the starting point that motivates actor–critic.

3. The Variance Problem in REINFORCE

Imagine training an agent on a game. If the agent wins, all actions in that game get a credit boost — even the bad ones that happened to precede good luck. The return $G_t$ lumps together hundreds of random events, creating extremely noisy gradient estimates.

🎲 Why Variance is Fatal

High variance means: (1) we need exponentially more samples to get reliable estimates, (2) gradient steps can harm performance, (3) learning collapses in long-horizon or stochastic environments. This is the fundamental weakness of pure REINFORCE.

The return $G_t$ is the sum of many random rewards — its variance grows roughly as $O(T \cdot \sigma_r^2 / (1-\gamma)^2)$. For $\gamma = 0.99$ and long horizons, this is enormous.

The variance of the REINFORCE gradient estimator:

$$\mathrm{Var}\!\left[\nabla_\theta \log \pi_\theta(A_t|S_t) \cdot G_t\right] = \mathbb{E}[\|\nabla_\theta \log \pi_\theta\|^2 \cdot G_t^2] - \|\nabla_\theta J\|^2$$

The first term grows with $\mathrm{Var}[G_t]$, which for a T-step discounted return is bounded by $\frac{r_{\max}^2 \cdot T}{(1-\gamma)^2}$. For large $T$ (or $\gamma \to 1$), this variance explodes.

⚠️

Sample efficiency: To achieve $\varepsilon$-accurate gradient, REINFORCE requires $O(1/\varepsilon^2)$ samples, which is fine in principle but the constant is enormous in practice due to high variance.

4. Baselines and Advantage Functions

The key insight: we don't need the absolute return — we need to know if an action was better or worse than average. Subtract a baseline $b(s)$ from the return. If $b(s)$ approximates the expected return in state $s$, then actions better than average get a positive signal and worse-than-average get negative — much lower variance.

The optimal baseline (minimizing variance) is approximately $b(s) \approx V^{\pi_\theta}(s)$. This gives us the advantage functionA^π(s,a) = Q^π(s,a) - V^π(s). Measures how much better action a is than the average action in state s under policy π. Zero-centered, which dramatically reduces variance.:

For any baseline $b(s)$ that does not depend on $a$:

$$\mathbb{E}_{a \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = b(s) \nabla_\theta \sum_a \pi_\theta(a|s) = 0$$

Baselines do not introduce bias! So we can subtract any state-dependent $b(s)$ without affecting the expected gradient.

Advantage Function
$$A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)$$
✅ Policy Gradient with Baseline
$$\nabla_\theta J(\theta) = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(A_t|S_t) \cdot A^{\pi_\theta}(S_t, A_t)\right]$$

This is unbiased (same as before) but lower variance. The advantage is zero-centered — it only amplifies policy changes when actions deviate from average.

The problem: we don't know $V^{\pi_\theta}(s)$ exactly. We need to learn it. And the natural way to learn a value function is TD learning. This necessity gives birth to actor–critic methods.

5. The Birth of Actor–Critic

The logical progression: REINFORCE has high variance → we need a baseline → the best baseline is $V^\pi$ → we need to learn $V^\pi$ → a separate learning component is needed → this component is the critic.

🎭 Actor

The policy $\pi_\theta(a|s)$. Parameterized by $\theta$. Selects actions. Updated slowly via policy gradient. Aims to maximize expected return. Uses critic's feedback.

🧑‍⚖️ Critic

The value function $V_w(s)$ (or $Q_w(s,a)$). Parameterized by $w$. Evaluates states (or state-actions). Updated faster via TD. Provides the baseline or advantage signal.

🔗 Why This Works Better

Instead of waiting until the end of an episode (Monte Carlo), the critic can give immediate feedback after each step via bootstrapping. This dramatically reduces variance at the cost of some bias — a favorable tradeoff in practice.

Architectural variants of actor–critic differ in what the critic learns:

Critic TypeLearnsAdvantage Estimate
State-value critic$V_w(s)$$\delta_t = r_t + \gamma V_w(S_{t+1}) - V_w(S_t)$ (TD error)
Action-value critic$Q_w(s,a)$$Q_w(s,a) - V_w(s)$ or $Q_w(s,a) - \mathbb{E}_{a'}Q_w(s,a')$
Advantage critic$A_w(s,a)$ directly$A_w(s,a)$ (dueling architecture)

6. Basic Actor–Critic Algorithm

At each timestep: (1) The actor picks an action using current policy. (2) Environment returns reward and next state. (3) Critic computes the TD error — how wrong was its value estimate? (4) Critic updates toward the correct value. (5) Actor updates using the TD error as a quality signal — positive TD error means "this was better than expected, do it more."

ℹ️

The TD error $\delta_t = r_t + \gamma V_w(S_{t+1}) - V_w(S_t)$ is an unbiased sample of the advantage $A^\pi(S_t, A_t)$ when the critic is correct — this is why it works as a gradient weight for the actor.

Algorithm 1: One-Step Actor–Critic
Initialize: policy params θ, value params w, state S₀
For t = 0, 1, 2, ... do
  // Actor samples action
  Aₜ ~ πθ(·|Sₜ)
  // Environment step
  Observe Rₜ, Sₜ₊₁
  // Critic: compute TD error
  δₜRₜ + γ · Vw(Sₜ₊₁) - Vw(Sₜ)
  // Critic update (faster)
  ww + βt · δₜ · ∇wVw(Sₜ)
  // Actor update (slower)
  θθ + αt · δₜ · ∇θlog πθ(Aₜ|Sₜ)
  SₜSₜ₊₁
Actor–Critic Updates
$$\delta_t = R_t + \gamma V_w(S_{t+1}) - V_w(S_t) \quad \text{(TD error / advantage estimate)}$$ $$w_{t+1} = w_t + \beta_t \cdot \delta_t \cdot \nabla_w V_w(S_t) \quad \text{(critic update)}$$ $$\theta_{t+1} = \theta_t + \alpha_t \cdot \delta_t \cdot \nabla_\theta \log \pi_\theta(A_t|S_t) \quad \text{(actor update)}$$

7. TD Learning as the Critic

TD(0)Temporal Difference learning with 0-step lookahead. Updates based on one-step bootstrap: V(s) ← V(s) + α[r + γV(s') - V(s)]. Learns online, step-by-step. is the simplest critic: after each step, update the value estimate by moving it a bit toward the bootstrapped target $r + \gamma V(s')$. The TD error $\delta = r + \gamma V(s') - V(s)$ measures the "surprise" — how much better or worse the outcome was than expected.

⚖️ Bias–Variance Tradeoff in Critics
MethodBiasVarianceData needed
Monte Carlo ($G_t$)NoneVery highFull episode
TD(0)Low (from bootstrap)Low1 step
TD(n)DecreasingIncreasingn steps
TD(λ)TunableTunableOnline

TD(0) minimizes the Mean Squared Bellman Error (MSBE) when combined with linear function approximation $V_w(s) = w^\top \phi(s)$:

$$\text{MSBE}(w) = \mathbb{E}_{s \sim d^\pi}\!\left[\left(r + \gamma V_w(s') - V_w(s)\right)^2\right]$$

The semi-gradient TD(0) update:

$$w \leftarrow w + \beta \cdot \delta_t \cdot \phi(S_t), \quad \delta_t = R_t + \gamma w^\top \phi(S_{t+1}) - w^\top \phi(S_t)$$

Under linear FA and on-policy sampling, TD(0) converges to $w^* = A^{-1}b$ where $A = \mathbb{E}[\phi(\phi - \gamma\phi')^\top]$ and $b = \mathbb{E}[r\phi]$. This is the TD fixed point, not the true MSBE minimizer — it introduces a small bias.

α: 0.05
Ready

8. Advantage Actor–Critic (A2C)

A2C (synchronous Advantage Actor-Critic, 2016) makes the advantage estimate explicit and clean. The actor gradient uses the advantage $\hat{A}_t$ rather than the raw TD error or return. Multiple parallel workers collect experience, stabilizing gradients.

Generalized Advantage Estimation (GAE)

Schulman et al. (2015) proposed a $\lambda$-weighted blend of n-step advantages:

$$\hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

$\lambda=0$: TD(0), $\lambda=1$: Monte Carlo. $\lambda$ controls bias–variance.

Entropy Regularization

Add an entropy bonus to encourage exploration and prevent premature convergence:

$$\theta \leftarrow \theta + \alpha \nabla_\theta \left[\hat{A}_t \log \pi_\theta + c \cdot H(\pi_\theta)\right]$$

$H(\pi) = -\sum_a \pi(a|s)\log\pi(a|s)$. Standard in A3C, SAC, PPO.

A2C Policy Update
$$\theta \leftarrow \theta + \alpha \cdot \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(A_t^{(i)}|S_t^{(i)}) \cdot \hat{A}_t^{(i)}$$
Critic Update (MSE)
$$w \leftarrow w - \beta \cdot \nabla_w \frac{1}{N}\sum_i \left(V_w(S_t^{(i)}) - \hat{R}_t^{(i)}\right)^2$$

where $\hat{R}_t = \sum_{k=0}^{T-1} \gamma^k r_{t+k} + \gamma^T V_w(S_{t+T})$ is the n-step return target. The critic minimizes MSE toward this target; the actor maximizes advantage-weighted log-probability.

9. The Stochastic Approximation View

Stochastic approximation (Robbins–Monro, 1951) provides the theoretical backbone for all iterative RL algorithms. The general form is:

$$x_{t+1} = x_t + \alpha_t \left[f(x_t) + \text{noise}_t\right]$$

where $f(x) = 0$ defines the target (fixed point) and noise is zero-mean. Under appropriate conditions on step sizes $\alpha_t$ and $f$, $x_t \to x^*$.

Both the actor and critic updates are stochastic approximation recursions — noisy gradient steps. The question is: can both converge simultaneously? The answer requires separation of timescales.

Robbins–Monro conditions for convergence:

$$\sum_{t=0}^\infty \alpha_t = \infty, \qquad \sum_{t=0}^\infty \alpha_t^2 < \infty$$

Typical schedules: $\alpha_t = c/(t+1)$ satisfies both. The first condition ensures enough "movement" to reach the target; the second ensures updates shrink enough to converge (not oscillate).

Actor–critic involves two coupled stochastic approximation recursions (actor: $\theta_t$, critic: $w_t$). This coupling is the core theoretical challenge, resolved by the two-timescale framework.

10. Two-Timescale Stochastic Approximation

This section contains the rigorous theoretical backbone of actor–critic convergence — arguably the most important mathematical result in the field.

Consider two people adjusting two coupled knobs. If one person adjusts their knob much faster than the other, then from the slow person's perspective, the fast knob has already settled to its best position given the current slow-knob value. From the fast person's perspective, the slow knob barely moves.

⏱️ The Timescale Separation Principle

Critic (fast): Updates rapidly with large step size $\beta_t \gg \alpha_t$. Given the current actor $\theta$, the critic converges to $w^*(\theta)$ — the best value estimate for that policy.


Actor (slow): Updates slowly. From its perspective, the critic is always converged. So the actor gradient uses an accurate (near-optimal) advantage estimate at each step.

This timescale separation means: even though both are updating simultaneously, theoretically we can analyze them as if the critic were solved first. This is what makes the coupled convergence proof tractable.

10.1 Formal Setup

The two coupled update recursions are:

Actor (slow timescale)
$$\theta_{t+1} = \theta_t + \alpha_t \cdot h(\theta_t, w_t, \xi_t)$$
Critic (fast timescale)
$$w_{t+1} = w_t + \beta_t \cdot g(\theta_t, w_t, \zeta_t)$$

where $\xi_t, \zeta_t$ are noise terms, $h = \nabla_\theta J_w(\theta)$ (policy gradient using critic), and $g = \delta_t \nabla_w V_w$ (TD update).

📐 Two-Timescale Condition
$$\frac{\alpha_t}{\beta_t} \to 0 \quad \text{as } t \to \infty$$

Equivalently: $\beta_t \gg \alpha_t$ (critic step size dominates actor step size). Both must satisfy Robbins–Monro individually.

Example schedules satisfying this:

$$\alpha_t = \frac{c_1}{(t+1)^{0.6}}, \quad \beta_t = \frac{c_2}{(t+1)^{0.4}} \qquad \Rightarrow \quad \frac{\alpha_t}{\beta_t} = \frac{c_1}{c_2}(t+1)^{-0.2} \to 0$$

10.2 ODE Interpretation (Borkar–Meyn Framework)

A

Fast Timescale ODE (Critic)

Viewing $w_t$ on its own timescale (stretch time by $1/\beta_t$), the fast iterate tracks:

$$\dot{w}(\tau) = \bar{g}(\theta, w(\tau)), \quad \bar{g}(\theta,w) = \mathbb{E}[g(\theta,w,\zeta)]$$

With $\theta$ frozen (it moves infinitely slower), the critic ODE has a globally attracting equilibrium $w^*(\theta)$ for each $\theta$. The critic converges to this manifold.

B

Slow Timescale ODE (Actor)

On the actor's timescale, the critic appears converged. The actor tracks:

$$\dot{\theta}(\tau) = \bar{h}(\theta, w^*(\theta)), \quad \bar{h}(\theta,w) = \mathbb{E}[h(\theta,w,\xi)]$$

This is the "true" policy gradient ODE, with advantages computed using the converged critic $w^*(\theta)$.

C

Convergence Conclusion

Under technical conditions (Lipschitz dynamics, bounded iterates or stability, ergodicity of the Markov chain), the Borkar–Meyn theorem gives:

$$(\theta_t, w_t) \xrightarrow{a.s.} (\theta^*, w^*(\theta^*))$$

where $\theta^*$ is a local maximum of $J(\theta)$ (a stationary point of the actor ODE). The actor converges to a (local) optimal policy.

Why this matters: Without two-timescale analysis, we cannot guarantee actor–critic convergence — the critic bias would compound. This framework justifies using $\beta_t \gg \alpha_t$ in practice and underpins convergence proofs in A2C, TD-actor-critic, and natural gradient methods.

Interactive: Two-Timescale Dynamics

β/α ratio:
Step 0

11. Deterministic Policy Gradient

Stochastic policies work well for discrete actions but are inefficient in high-dimensional continuous action spaces — you'd need to sample many actions to estimate the gradient. Silver et al. (2014) showed that for deterministic policies $\mu_\theta: \mathcal{S} \to \mathcal{A}$, a simpler gradient formula exists.

Intuition: the gradient just points in the direction of action that the Q-function says is best. Instead of averaging over random actions, we follow the Q-gradient in action space.

Deterministic Policy Gradient Theorem
$$\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s \sim d^{\mu_\theta}}\!\left[\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q^{\mu_\theta}(s,a)\big|_{a=\mu_\theta(s)}\right]$$

The gradient of $J$ w.r.t. $\theta$ is the chain rule: how action changes with $\theta$, times how Q-value changes with action. No need to average over actions. This requires off-policy training with an exploration policy $\beta \neq \mu_\theta$.

📐 Connection to Stochastic PG (Limit Argument)

DPG is the limit of the stochastic policy gradient as policy variance $\sigma \to 0$. For a Gaussian policy $\pi_{\theta,\sigma}(a|s) = \mathcal{N}(\mu_\theta(s), \sigma^2 I)$:

$$\lim_{\sigma \to 0} \nabla_\theta J(\pi_{\theta,\sigma}) = \nabla_\theta J(\mu_\theta) = \mathbb{E}\!\left[\nabla_\theta \mu_\theta \cdot \nabla_a Q^{\mu_\theta}\right]$$

So DPG is not a fundamentally different theorem — it's a special case of the PG theorem in the deterministic limit.

12. Deep Actor–Critic Methods

12.1 DDPG (Deep Deterministic Policy Gradient)

DPG + neural networks + replay buffer + target networks (Lillicrap et al., 2015). Key innovations: (1) off-policy experience replay for sample efficiency, (2) target networks — slowly-updated copies — for training stability, (3) batch normalization for feature scaling.

🎯 DDPG Key Tricks
  • Replay Buffer: $\mathcal{B}$ stores $(s,a,r,s')$; random mini-batches break correlations
  • Target Networks: $\theta^-, w^-$ — updated as $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ ($\tau \ll 1$)
  • Ornstein-Uhlenbeck noise: Temporally correlated exploration noise for continuous actions
DDPG Critic Update
$$\mathcal{L}(w) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{B}}\!\left[\left(Q_w(s,a) - \underbrace{(r + \gamma Q_{w^-}(s', \mu_{\theta^-}(s')))}_{\text{target}}\right)^2\right]$$
DDPG Actor Update
$$\nabla_\theta J \approx \frac{1}{N}\sum_i \nabla_\theta \mu_\theta(s_i) \cdot \nabla_a Q_w(s_i, a)\big|_{a=\mu_\theta(s_i)}$$

12.2 TD3 (Twin Delayed Deep Deterministic)

🛡️ TD3 Fixes (Fujimoto et al., 2018)

DDPG suffers from Q-value overestimation. TD3 introduces three fixes:

  1. Clipped Double Q-learning: Maintain two critics $Q_{w_1}, Q_{w_2}$, use $\min$ for target: $y = r + \gamma \min_i Q_{w_i^-}(s', \tilde{a})$
  2. Delayed Policy Updates: Update actor every $d$ critic steps (typically $d=2$) — critic must first stabilize
  3. Target Policy Smoothing: Add noise to target actions: $\tilde{a} = \mu_{\theta^-}(s') + \text{clip}(\varepsilon, -c, c)$

12.3 SAC (Soft Actor–Critic)

SAC (Haarnoja et al., 2018) is the current gold standard for continuous control. Its key innovation: maximize both return and policy entropy simultaneously (maximum entropy RL framework). This naturally encourages exploration and robustness.

Intuition: an agent that maximizes entropy tries to be as random as possible while still achieving high reward. This prevents premature commitment to suboptimal solutions and gives a natural exploration-exploitation balance.

SAC augments the RL objective with an entropy term:

SAC Objective
$$J(\theta) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}\!\left[r(s_t,a_t) + \alpha H(\pi_\theta(\cdot|s_t))\right]$$
Soft Bellman Backup
$$Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}\!\left[V^*(s')\right], \quad V^*(s) = \mathbb{E}_{a\sim\pi}\!\left[Q^*(s,a) - \alpha\log\pi(a|s)\right]$$

The temperature $\alpha$ balances reward vs. entropy. SAC uses automatic entropy tuning: $\alpha$ is optimized to maintain $H(\pi) \geq \bar{H}$ (a target entropy threshold).

MethodPolicyOff-policy?Key InnovationBest for
DDPGDeterministicYesDPG + NN + replayBaseline continuous control
TD3DeterministicYesClipped double Q, delayWhen overestimation is critical
SACStochasticYesMax-entropy, auto-αState-of-the-art robustness
PPOStochasticNo (near)Clipped surrogate objectiveSimplicity, robustness

13. Modern Perspectives

Offline RL

Conservative Q-Learning (CQL), IQL: actor–critic with pessimistic value estimates from offline datasets. The critic penalizes OOD actions to prevent overestimation without online exploration.

KL Regularization

MPO, VMPO: actor–critic with trust-region constraints. Policy improvement is a KL-constrained optimization. Related to PPO's clipped surrogate and SAC's entropy bonus.

Representation Learning

CURL, DrQ, DreamerV3: use auxiliary losses (contrastive, reconstruction) to learn better state representations. The critic benefits most from rich feature representations.

🌐 RLHF and Foundation Models

Modern LLM alignment (RLHF, PPO from OpenAI/Anthropic) is actor–critic at scale: the language model is the actor, a reward model is the proxy for the critic signal, and PPO performs policy updates with KL constraints against the reference model. The two-timescale intuition still applies: reward model is trained first, then the policy is optimized.

14. Theoretical Insights & Challenges

14.1 The Deadly Triad

☠️ Sutton's Deadly Triad (2018)

Instability and divergence occur when all three are present simultaneously:

Function Approximation
Neural networks for V or Q
Bootstrapping
TD learning (using own estimates)
Off-Policy Training
Experience replay, behavior policy

Actor–critic with replay buffers hits all three. Solutions: gradient clipping, target networks, conservative critics, double Q-learning, periodic resets.

14.2 Non-Stationarity

As the actor updates, the data distribution changes — the critic is learning a moving target. This makes off-policy corrections (importance sampling) theoretically necessary but practically unstable. Most algorithms ignore this and rely on near-on-policy experience in the replay buffer.

14.3 Convergence Guarantees (Summary)

AlgorithmConvergenceConditions
TD(0) + Linear FAConvergent to TD fixed pointOn-policy, linear FA, Robbins–Monro
Linear AC (2-timescale)Convergent to local optimumLinear critic, two-timescale, ergodic MDP
DDPG/TD3 (deep)No formal guaranteeEmpirically stable with tricks
Natural Policy GradientConvergent (monotone)Exact Fisher, tabular or linear
PPOMonotone improvement (approx.)Trust region approximation
📐 Natural Policy Gradient & Fisher Information

The natural policy gradient (Amari, 1998; Kakade, 2002) preconditions the actor update with the Fisher information matrix:

$$\theta \leftarrow \theta + \alpha \cdot F(\theta)^{-1} \nabla_\theta J(\theta), \quad F(\theta) = \mathbb{E}\!\left[\nabla\log\pi \nabla\log\pi^\top\right]$$

This is gradient ascent in the space of distributions (under KL geometry) rather than parameter space. The compatible function approximation critic (Sutton et al. 2000) gives: $w^* = F^{-1} \nabla J$ when $Q_w(s,a) = w^\top \nabla\log\pi_\theta(a|s)$. TRPO and PPO are practical approximations of NPG.

15. Summary

🎭 Actor
  • Policy $\pi_\theta$ or $\mu_\theta$
  • Updated by policy gradient
  • Slow timescale ($\alpha_t$)
  • Uses critic's advantage signal
🧑‍⚖️ Critic
  • Value function $V_w$ or $Q_w$
  • Updated by TD learning
  • Fast timescale ($\beta_t \gg \alpha_t$)
  • Reduces actor gradient variance
EquationRole
$\nabla_\theta J = \mathbb{E}[\nabla\log\pi_\theta \cdot Q^\pi]$Policy gradient theorem
$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$Advantage (variance reduction)
$\delta_t = r_t + \gamma V_w(s') - V_w(s)$TD error (advantage estimate)
$w \mathrel{+}= \beta_t \delta_t \nabla_w V_w(s)$Critic update (TD)
$\theta \mathrel{+}= \alpha_t \delta_t \nabla_\theta \log\pi_\theta(a|s)$Actor update (policy gradient)
$\alpha_t / \beta_t \to 0$Two-timescale condition
$\nabla_\theta J(\mu) = \mathbb{E}[\nabla_\theta\mu \cdot \nabla_a Q]$Deterministic policy gradient

16. Historical Timeline

Click any entry to expand.

1951
Stochastic Approximation (Robbins–Monro)
The mathematical foundation for all iterative RL algorithms. Proved that noisy gradient steps converge under sum/sum-of-squares conditions on step sizes. Still the backbone of every actor and critic update rule today.
1983
Neuronlike Adaptive Elements (Barto, Sutton & Anderson)
The first explicit actor–critic architecture — an "ASE" (actor) and "ACE" (critic) for the cart-pole problem. Remarkably close to modern A2C. Preceded policy gradient theory by nearly two decades.
1988
TD Learning (Sutton)
Temporal Difference learning formalized. TD(0), TD(λ), eligibility traces. Provides the modern critic machinery. "Learning to predict by the methods of temporal differences" — one of the most-cited RL papers.
1992
REINFORCE (Williams)
Policy gradient algorithm via the REINFORCE trick (log-derivative). Established policy gradients as a field. High variance necessitated baselines, which necessitated critics. The direct precursor to actor–critic theory.
1999–2000
Policy Gradient Theorem & Two-Timescale Theory
Sutton et al. (2000) proved the policy gradient theorem, giving a clean formula for ∇J. Konda & Tsitsiklis (1999/2003) proved convergence of actor–critic under two-timescale stochastic approximation — the first rigorous convergence proof.
2002
Natural Policy Gradient (Kakade)
Applied Amari's information geometry to policy optimization. NPG uses the Fisher information matrix to precondition gradients, giving policy-invariant updates. Ancestor of TRPO, PPO, and the compatible function approximation framework.
2014–2015
DPG & DDPG (Silver et al., Lillicrap et al.)
Deterministic Policy Gradient theorem extended PG to deterministic policies — critical for high-dimensional continuous action spaces. DDPG combined DPG with DQN's replay buffer and target networks, enabling deep actor–critic for robotics.
2016
A3C / A2C (Mnih et al., DeepMind)
Asynchronous Advantage Actor–Critic scaled actor–critic to deep networks with parallel workers on CPU. Used n-step returns and entropy regularization. Became the template for modern deep actor–critic methods.
2017–2018
PPO, TD3, SAC
PPO (Schulman 2017): clipped surrogate objective for stable on-policy updates. TD3 (Fujimoto 2018): fixed DDPG overestimation with clipped double Q. SAC (Haarnoja 2018): maximum entropy actor–critic with automatic temperature tuning. These remain dominant algorithms in 2024.
2020–2024
Offline RL, World Models, RLHF
Offline AC methods (CQL, IQL) brought actor–critic to offline data. DreamerV3 uses latent world models with an AC loop in imagination. RLHF applies PPO to align LLMs — actor–critic at billion-parameter scale. The framework shows no signs of being replaced.

References: Sutton & Barto (2018) Reinforcement Learning: An Introduction · Konda & Tsitsiklis (2003) On Actor-Critic Algorithms · Silver et al. (2014) Deterministic Policy Gradient · Haarnoja et al. (2018) Soft Actor–Critic · Borkar (2008) Stochastic Approximation: A Dynamical Systems Viewpoint · Schulman et al. (2015) High-Dimensional Continuous Control using GAE