Policy Gradient Methods
& RLHF
A complete, derivation-first treatment from first principles to modern LLM alignment.
Motivation & Intuition
Reinforcement learning is the science of sequential decision making under uncertainty. An agent interacts with an environment, takes actions, and receives rewards. The goal is to learn a policy—a mapping from states to actions—that maximizes cumulative reward.
Imagine training a dog. You can't tell the dog exactly what muscle movements to make. You reward good behaviors and ignore or correct bad ones. Over time, the dog learns which behaviors lead to rewards. Policy gradient methods do exactly this—they adjust the probability of actions based on how well they worked.
Why Policy Gradients?
Before policy gradients, RL relied primarily on value-based methods (Q-learning, SARSA). These methods learn the value of state-action pairs and derive a policy implicitly. But they face fundamental limitations:
| Aspect | Value-Based | Policy Gradient |
|---|---|---|
| Action space | Discrete only | Continuous + Discrete |
| Stochastic policies | Awkward | Natural |
| Convergence | Sometimes unstable | Smooth gradient descent |
| Policy representation | Implicit | Explicit, parameterized |
| Sample efficiency | High | Lower (on-policy) |
Policy gradient methods are especially critical for LLM training: a language model's action space (vocabulary tokens) is discrete but enormous (~50,000 tokens), and we need stochastic policies that can represent nuanced, diverse outputs. Value-based methods cannot scale here.
Policy Parameterization
A policy $\pi_\theta$ is a probability distribution over actions given states, parameterized by $\theta$ (e.g., neural network weights):
Deterministic policy: $\mu_\theta: \mathcal{S} \to \mathcal{A}$. Maps states directly to actions.
Stochastic policy: $\pi_\theta: \mathcal{S} \times \mathcal{A} \to [0,1]$. Maps (state, action) to a probability.
Common Parameterizations
Softmax policy (discrete actions):
where $h_\theta(s, a)$ is the preference (logit) of action $a$ in state $s$.
Gaussian policy (continuous actions):
For LLMs: The policy is a transformer with parameters $\theta$. Given context (state) $s = (x_1, \ldots, x_t)$, the policy outputs a distribution over the next token:
Policy Gradient Intuition
The core idea is strikingly simple: make good actions more probable, make bad actions less probable. We define a performance objective and take gradient steps to improve it.
Suppose an agent takes action $a$ in state $s$ and receives reward $R$. If $R$ was high, we want $\pi_\theta(a|s)$ to increase. If $R$ was low, we want it to decrease. The policy gradient algorithm does exactly this by adjusting $\theta$ in the direction that increases expected reward.
🎛 Interactive: Policy Distribution Update
Watch how the action distribution changes as we apply policy gradient updates. Blue = initial, Red = updated after receiving reward signal.
The policy gradient The gradient of expected return w.r.t. policy parameters θ. Points in the direction of steepest increase in expected reward. tells us how to adjust $\theta$ to increase expected return. The key insight is that we can estimate this gradient using samples from the policy—no model of the environment is needed.
The Policy Gradient Theorem
(A) Intuition: Increase Probability of Good Actions
We want to compute $\nabla_\theta J(\theta)$ so we can do gradient ascent. The challenge: $J(\theta)$ depends on the distribution of trajectories induced by $\pi_\theta$, which also depends on $\theta$. The Policy Gradient Theorem gives us a clean, computable form for this gradient.
(B) Formal Statement
Define the performance objective as expected cumulative discounted reward:
The Policy Gradient Theorem states:
where $Q^{\pi_\theta}(s_t, a_t) = \mathbb{E}\left[\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\right]$ is the action-value function (expected future reward from $(s_t, a_t)$).
(C) Detailed Step-by-Step Derivation
The performance objective is the expected return over trajectories $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$:
where the trajectory probability decomposes as:
Note: $\rho_0$ is the initial state distribution and $P$ is the (unknown) transition dynamics. The key observation: the policy $\pi_\theta$ is the only term that depends on $\theta$.
Differentiate with respect to $\theta$:
We move the gradient inside the integral (valid under mild regularity conditions). Now we need to compute $\nabla_\theta p_\theta(\tau)$.
The log-derivative trick (also called the likelihood ratio trick) uses the identity:
This follows from $\nabla_\theta \log f = \frac{\nabla_\theta f}{f}$, rearranged. Substituting:
We've converted an integral over a changing distribution into an expectation under $p_\theta$—which we can estimate by sampling trajectories! The environment dynamics vanish in the next step.
Taking the log of the trajectory probability:
Taking the gradient with respect to $\theta$:
The initial state distribution and environment dynamics do not depend on $\theta$, so their gradients are zero. We're left with only policy terms!
Substituting back, we get the REINFORCE policy gradient estimator:
This is an unbiased estimator of the true gradient. We can estimate it from samples without knowing the environment dynamics.
A key insight: past rewards cannot be influenced by current actions. We can replace $R(\tau)$ with future reward (reward-to-go):
This reduces variance without introducing bias! The term $\sum_{t'=t}^T \gamma^{t'-t} r_{t'}$ is the sample estimate of $Q^{\pi_\theta}(s_t, a_t)$.
The complete, general form of the Policy Gradient Theorem (Sutton et al., 2000):
This holds for both episodic and continuing tasks. The key components are:
- $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ — the score function (how to increase log-prob of $a_t$)
- $Q^{\pi_\theta}(s_t, a_t)$ — the quality signal (how good was this action?)
The gradient says: move $\theta$ to increase the log-probability of action $a_t$, but scale this increase by how good that action was (its Q-value). High-reward actions get amplified; low-reward actions get suppressed.
REINFORCE Algorithm
REINFORCE (Williams, 1992) is the simplest instantiation of the policy gradient theorem. It estimates $Q^{\pi}(s_t, a_t)$ using Monte Carlo returns—complete trajectory rollouts.
The REINFORCE update rule:
REINFORCE is unbiased but suffers from very high variance. Different trajectories starting from the same state can yield vastly different $\hat{G}_t$ values due to stochasticity. This makes learning slow and unstable. The solution: baselines and advantage functions (next section).
The term $\nabla_\theta \log \pi_\theta(a|s)$ is called the score function or log-derivative. For a Gaussian policy: $\nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \nabla_\theta \mu_\theta(s)$. For a softmax policy: $\nabla_\theta \log \pi_\theta(a|s) = \phi(s,a) - \sum_{a'} \pi_\theta(a'|s) \phi(s,a')$ where $\phi$ is a feature vector.
Variance Reduction
Baselines
A baseline $b(s)$ is any function of the current state (not the action) that we subtract from the return without introducing bias. The modified gradient:
Why is this unbiased? Because $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$.
The variance-minimizing baseline is $b^*(s) = \frac{\mathbb{E}[\|\nabla_\theta \log \pi\|^2 Q]}{\mathbb{E}[\|\nabla_\theta \log \pi\|^2]}$. In practice, we use the value function $V^\pi(s)$ as a convenient, near-optimal baseline.
The Advantage Function
Using the value function $V^\pi(s_t)$ as baseline gives the advantage function:
The advantage function A(s,a) measures how much better action a is compared to the average action in state s. Positive advantage = better than average; negative = worse than average. answers the question: "How much better is action $a$ compared to the average action in state $s$?"
🎛 Visualizing the Advantage Function
The advantage tells you which actions are above (green) or below (red) the state's average value. The gradient only amplifies actions relative to this baseline.
Generalized Advantage Estimation (GAE)
GAE (Schulman et al., 2015) provides a bias-variance trade-off using temporal differences:
where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual. Parameter $\lambda \in [0,1]$: $\lambda=0$ gives pure TD (low variance, high bias); $\lambda=1$ gives Monte Carlo (high variance, low bias).
Actor–Critic Connection
Actor-Critic methods maintain two separate function approximators:
🎭 Actor
Policy $\pi_\theta(a|s)$
Decides which action to take
🧑⚖️ Critic
Value $V_w(s)$ or $Q_w(s,a)$
Evaluates how good the action was
🔄 Feedback
Advantage $A = Q - V$
Signal to update actor
Think of an actor on stage and a critic reviewing the performance. The actor tries different actions; the critic evaluates them. The actor uses the critic's feedback to improve. This synergy combines the advantages of both value-based (lower variance via critic) and policy-based (works in continuous spaces) methods.
A2C / A3C Architecture
The actor loss (gradient ascent on policy):
The critic loss (value function regression):
Often combined with an entropy bonus to encourage exploration:
Advanced Policy Gradient Methods
Natural Policy Gradient (NPG)
Standard gradient descent treats all parameter directions equally. But the policy space has a Riemannian geometry: equal changes in $\theta$ can have wildly different effects on $\pi_\theta$. The natural gradient accounts for this using the Fisher Information Matrix F(θ) = E[∇logπ · ∇logπᵀ] measures the curvature of the KL divergence between nearby policies. It encodes how policy-space distances relate to parameter-space distances. :
NPG makes updates that are invariant to the parameterization of $\theta$—a critical property for neural networks where the same policy can be represented by many different $\theta$ values.
Trust Region Policy Optimization (TRPO)
TRPO (Schulman et al., 2015) formalizes the intuition that policy updates should not be too large. It solves a constrained optimization problem:
The KL divergence constraint KL(p‖q) = Σ p(x) log(p(x)/q(x)). Here it measures how much the new policy deviates from the old one. TRPO requires this to be small (≤ δ ≈ 0.01). ensures the new policy stays close to the old one, maintaining the validity of the importance-weighted objective. TRPO uses conjugate gradients and line search, making it theoretically sound but computationally expensive.
PPO — Derivation & Intuition
Proximal Policy Optimization (Schulman et al., 2017) is arguably the most important modern policy gradient algorithm. It achieves TRPO's stability with a fraction of the computational cost.
(A) Starting Point: Policy Gradient Objective with Importance Sampling
When we collect data under policy $\pi_{\theta_\text{old}}$ but optimize $\theta \neq \theta_\text{old}$, we need importance sampling:
Define the probability ratio:
The surrogate objective (CPI — Conservative Policy Iteration):
Without constraints, maximizing $L^\text{CPI}$ can lead to destructively large policy updates. If $r_t(\theta)$ becomes very large, the importance-weighted advantage estimate becomes unreliable (high variance, potential bias).
(B) Introducing the KL Constraint (TRPO Approach)
TRPO adds an explicit KL constraint. The theoretical guarantee (from the surrogate objective bound):
This shows the true objective $J(\theta)$ is lower bounded by the surrogate minus a KL penalty. Maximizing this lower bound gives monotonic improvement guarantees.
(C) Trust Region Idea
Imagine you're on a hill (performance landscape) and want to move uphill. But your map (surrogate objective) is only accurate near your current position. A trust region says: "only take steps within a radius where the map is trustworthy." TRPO enforces this via KL constraint; PPO enforces it via clipping.
(D) PPO Clipped Objective — Full Derivation
TRPO requires solving a constrained optimization problem at each step, which involves:
- Computing the Fisher Information Matrix (or its inverse)
- Conjugate gradient to find the natural gradient direction
- Line search to enforce the KL constraint
This is expensive and complex. PPO achieves similar empirical performance by directly clipping the objective, preventing large probability ratios without solving a constrained problem.
For the surrogate $L^{\text{CPI}} = \mathbb{E}_t[r_t(\theta) \hat{A}_t]$, consider two cases:
Case 1: $\hat{A}_t > 0$ (action was better than average)
Gradient ascent increases $r_t(\theta) = \pi_\theta / \pi_{\theta_\text{old}}$, meaning $\pi_\theta(a_t|s_t)$ increases. But if we over-optimize, $r_t$ can become very large—the policy changes too much.
Case 2: $\hat{A}_t < 0$ (action was worse than average)
Gradient ascent decreases $r_t(\theta)$, meaning $\pi_\theta(a_t|s_t)$ decreases. Again, without constraint, $r_t$ can become very small (near zero)—also a large policy change.
In both cases, we want to limit how far $r_t(\theta)$ deviates from 1.
PPO clips the probability ratio to stay within $[1-\epsilon, 1+\epsilon]$ (typically $\epsilon = 0.2$):
This prevents the policy from changing too much from $\pi_{\theta_\text{old}}$ in a single update step.
PPO takes the minimum of the clipped and unclipped objectives, forming a pessimistic lower bound:
Why take the minimum? Analyze by cases:
When $\hat{A}_t > 0$: We want to increase $r_t$. But $\min(\cdot)$ caps the benefit at $r_t = 1+\epsilon$. Beyond this, the gradient becomes zero—no incentive to move the policy further.
When $\hat{A}_t < 0$: We want to decrease $r_t$. The $\min(\cdot)$ caps the penalty at $r_t = 1-\epsilon$. Beyond this, the gradient is again zero.
Result: Gradient is zero whenever the ratio is outside $[1-\epsilon, 1+\epsilon]$. The policy is never pushed to move further than the trust region allows.
The full PPO objective combines the clipped policy loss, value function loss, and entropy bonus:
where:
- $L^{\text{CLIP}}$ = clipped policy gradient loss
- $L^{\text{VF}}_t = (V_\theta(s_t) - V_t^\text{target})^2$ = value function squared error
- $H[\pi_\theta](s_t) = -\sum_a \pi_\theta(a|s_t)\log\pi_\theta(a|s_t)$ = entropy bonus for exploration
- $c_1 \approx 0.5$, $c_2 \approx 0.01$ are coefficients
🎛 PPO Clipping — Interactive Visualization
The PPO objective (orange) vs unclipped CPI (blue). Drag the advantage slider to see how clipping prevents large updates.
1. Bounded policy change: Clipping ensures $\pi_\theta$ cannot deviate too much from $\pi_{\theta_\text{old}}$ per update.
2. Multiple epochs on same data: Because updates are small, we can safely do K epochs (typically K=4-10) on the same collected data without divergence.
3. No 2nd-order methods: Unlike TRPO, PPO uses standard SGD/Adam, making it scalable to large neural networks (including LLMs).
Stochastic Approximation View
Policy gradient methods are instances of stochastic approximation (Robbins-Monro, 1951). We want to find $\theta^* = \arg\max_\theta J(\theta)$ where $J(\theta)$ cannot be computed exactly, only estimated.
where $\widehat{\nabla_\theta J}$ is an unbiased or consistent estimator of the true gradient. The Robbins-Monro conditions for convergence:
Typical choice: $\alpha_k = \frac{1}{k}$ or $\alpha_k = \frac{c}{k + c'}$. In practice, fixed learning rates with Adam often work better.
Gradient estimator bias: REINFORCE is unbiased ($\mathbb{E}[\hat{g}] = \nabla_\theta J$) but high variance. Actor-critic with approximate value function introduces bias (from approximation error) but lower variance. This bias-variance tradeoff is central to all policy gradient methods.
Deep Policy Gradient Methods
Modern deep RL combines policy gradients with deep neural networks as function approximators. Key developments:
Deep Deterministic Policy Gradient (DDPG)
For continuous action spaces, DDPG learns a deterministic policy $\mu_\theta(s)$ using the chain rule:
Soft Actor-Critic (SAC)
SAC maximizes a entropy-augmented objective, trading off reward for policy entropy:
SAC is off-policy (uses a replay buffer), entropy-regularized, and highly sample efficient—making it one of the most popular algorithms for continuous control.
Practical Tricks for Stability
| Trick | Purpose |
|---|---|
| Gradient clipping | Prevent exploding gradients in deep networks |
| Normalization of advantages | Reduce sensitivity to reward scale |
| Reward scaling/clipping | Stabilize learning across environments |
| Target networks (in critic) | Stabilize TD learning |
| Orthogonal initialization | Better gradient flow in deep policies |
RLHF Connection
(A) What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is the technique that transforms a pretrained language model into a helpful, harmless, and honest AI assistant. It aligns model behavior with human preferences using RL—specifically policy gradient methods.
In standard RL, the reward signal comes from the environment (e.g., game score). In RLHF, the "environment" is a human evaluator who tells the model which responses are better. Since we can't query humans at every step, we train a reward model to simulate human preferences, then optimize the LLM policy against this reward model using PPO.
(B) The RLHF Pipeline
Step 1
Pretraining
Train LLM on large text corpus (next-token prediction)
Step 2
SFT
Supervised fine-tuning on high-quality demonstrations
Step 3
Reward Model
Train $R_\phi$ on human pairwise preference data
Step 4
PPO
Optimize $\pi_\theta$ to maximize $R_\phi$ with KL constraint
(C) Role of Policy Gradient
The LLM is the policy $\pi_\theta$. Generating a response is a trajectory of token selections. The reward model provides a scalar reward at the end of generation:
The policy gradient update then pushes the LLM to generate responses that get higher reward from the reward model.
RLHF for LLMs (In Depth)
How GPT-style Models are Post-Trained
Starting from a pretrained model $\pi_\text{SFT}$, the RLHF loop works as follows:
Reward Modeling
The reward model $R_\phi$ is trained using the Bradley-Terry model for pairwise comparisons:
Training loss (negative log-likelihood of preferences):
KL Regularization: Why It's Critical
Without regularization, the policy would reward hack—find adversarial inputs that exploit weaknesses in the reward model while generating incoherent text. The KL penalty prevents this:
1. Prevents reward hacking: Keeps the policy from exploiting reward model weaknesses.
2. Maintains language quality: Prevents the policy from drifting into incoherent text.
3. Acts as regularization: Ensures the final model retains general capabilities from pretraining.
Per-Token Reward Formulation
In practice, the reward is applied per-token for the PPO advantage computation:
This distributes the terminal reward across tokens, enabling per-token advantage estimation.
DPO: Beyond PPO for LLMs
Direct Preference Optimization (Rafailov et al., 2023) bypasses the reward model entirely by showing that the optimal policy under KL-regularized RLHF satisfies:
DPO implicitly trains the reward model and policy simultaneously, dramatically simplifying the pipeline while achieving comparable results.
Modern Perspective
From RL to LLM Alignment: The Full Picture
| RL Concept | LLM Equivalent | Notes |
|---|---|---|
| Agent | Language model | Generates tokens autoregressively |
| State $s_t$ | Context $x + y_{1:t}$ | All tokens seen so far |
| Action $a_t$ | Next token $y_t$ | From vocabulary (~50k items) |
| Policy $\pi_\theta$ | LLM parameters $\theta$ | Softmax over vocabulary |
| Reward $r$ | Human preference score | Sparse: only at end of response |
| Episode | One query-response pair | Variable length |
| Environment dynamics | Deterministic (token appending) | No stochastic transitions! |
Current Research Frontiers
Constitutional AI (Anthropic): Self-critique and revision without human feedback at scale.
RLAIF: Replace human preferences with AI preferences (Claude/GPT-4 as judge).
Process Reward Models: Reward each reasoning step, not just final answer (Math, coding).
GRPO (DeepSeek): Group Relative Policy Optimization—removes value network, uses group baselines.
Online DPO: Iteratively generate new preference pairs and update the policy.
KTO: Kahneman-Tversky Optimization—uses unpaired preferences, simpler data collection.
Summary
1. Policy Gradient Theorem: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$. Unbiased, model-free, works in continuous spaces.
2. Log-derivative trick converts an intractable gradient of an expectation into an expectation of a gradient—making Monte Carlo estimation possible.
3. Variance reduction via advantage functions $A^\pi = Q^\pi - V^\pi$ is essential for practical training.
4. PPO achieves TRPO-like stability via clipping, making it the de facto standard for both game AI and LLM alignment.
5. RLHF maps perfectly to the RL framework: LLM = policy, tokens = actions, human preference = reward, KL penalty = trust region.
6. The same mathematics that trained Atari agents in 2013 now powers GPT-4, Claude, and Gemini—a beautiful example of algorithmic leverage.
| Algorithm | Key Idea | Pros | Cons |
|---|---|---|---|
| REINFORCE | Monte Carlo PG | Unbiased | High variance |
| A2C/A3C | Advantage actor-critic | Lower variance | Biased critic |
| TRPO | KL-constrained update | Monotonic improvement | 2nd-order, slow |
| PPO | Clipped surrogate | Stable + scalable | Heuristic clipping |
| SAC | Entropy-augmented | Efficient, continuous | Off-policy complexity |
| PPO+RLHF | Human reward model | Aligns to preferences | Reward hacking risk |
| DPO | Implicit reward | No reward model needed | Static data, may be unstable |
Historical Timeline
REINFORCE (Williams)
Simple statistical gradient following. Introduced the log-derivative trick for policy optimization. Foundational but high-variance.
Policy Gradient Theorem (Sutton, McAllester, Singh, Mansour)
Formal proof connecting policy gradients to Q-values. Natural Policy Gradient introduced by Kakade (2001), connecting to information geometry.
Deep RL Renaissance
DQN (Mnih et al., 2013) demonstrates deep RL at scale. DDPG (2015) extends to continuous actions. A3C (Mnih et al., 2016) enables parallelism.
TRPO → PPO (Schulman et al.)
TRPO provides monotonic improvement guarantees. PPO (2017) simplifies via clipping, becoming the dominant algorithm for both games (OpenAI Five) and LLM alignment.
RLHF Emerges for Language Models
Ziegler et al. (2019) apply RLHF to GPT-2. InstructGPT / ChatGPT (2022) demonstrates RLHF at scale: PPO + reward model trained on human preferences transforms GPT-3 into a helpful assistant.
DPO & LLM Alignment Research
DPO bypasses reward model entirely. Constitutional AI, RLAIF, and process rewards emerge. GPT-4, Claude, Gemini all trained with RLHF variants.
RL for Reasoning (DeepSeek-R1, o1)
GRPO and outcome reward models enable models to learn complex multi-step reasoning. RL from verifiable rewards (math, code) achieves superhuman performance on reasoning benchmarks.
Soumyadeep Roy,MTech(Res),IISc Bangalore