Research-Grade Educational Guide

Policy Gradient Methods
& RLHF

A complete, derivation-first treatment from first principles to modern LLM alignment.

📚 16 Sections 🧮 Full Derivations 🎛 Interactive Visualizations 🔗 RLHF Connection
§01

Motivation & Intuition

Reinforcement learning is the science of sequential decision making under uncertainty. An agent interacts with an environment, takes actions, and receives rewards. The goal is to learn a policy—a mapping from states to actions—that maximizes cumulative reward.

🧠 Core Intuition

Imagine training a dog. You can't tell the dog exactly what muscle movements to make. You reward good behaviors and ignore or correct bad ones. Over time, the dog learns which behaviors lead to rewards. Policy gradient methods do exactly this—they adjust the probability of actions based on how well they worked.

Why Policy Gradients?

Before policy gradients, RL relied primarily on value-based methods (Q-learning, SARSA). These methods learn the value of state-action pairs and derive a policy implicitly. But they face fundamental limitations:

Aspect Value-Based Policy Gradient
Action space Discrete only Continuous + Discrete
Stochastic policies Awkward Natural
Convergence Sometimes unstable Smooth gradient descent
Policy representation Implicit Explicit, parameterized
Sample efficiency High Lower (on-policy)

Policy gradient methods are especially critical for LLM training: a language model's action space (vocabulary tokens) is discrete but enormous (~50,000 tokens), and we need stochastic policies that can represent nuanced, diverse outputs. Value-based methods cannot scale here.

§02

Policy Parameterization

A policy $\pi_\theta$ is a probability distribution over actions given states, parameterized by $\theta$ (e.g., neural network weights):

$$\pi_\theta(a \mid s) = P(A_t = a \mid S_t = s, \theta)$$
📖 Definition

Deterministic policy: $\mu_\theta: \mathcal{S} \to \mathcal{A}$. Maps states directly to actions.

Stochastic policy: $\pi_\theta: \mathcal{S} \times \mathcal{A} \to [0,1]$. Maps (state, action) to a probability.

Common Parameterizations

Softmax policy (discrete actions):

$$\pi_\theta(a \mid s) = \frac{\exp(h_\theta(s, a))}{\sum_{a'} \exp(h_\theta(s, a'))}$$

where $h_\theta(s, a)$ is the preference (logit) of action $a$ in state $s$.

Gaussian policy (continuous actions):

$$\pi_\theta(a \mid s) = \mathcal{N}(\mu_\theta(s),\, \sigma_\theta(s)^2)$$

For LLMs: The policy is a transformer with parameters $\theta$. Given context (state) $s = (x_1, \ldots, x_t)$, the policy outputs a distribution over the next token:

$$\pi_\theta(a \mid s) = \text{softmax}(W \cdot \text{Transformer}_\theta(s))$$
§03

Policy Gradient Intuition

The core idea is strikingly simple: make good actions more probable, make bad actions less probable. We define a performance objective and take gradient steps to improve it.

🎯 Intuition First

Suppose an agent takes action $a$ in state $s$ and receives reward $R$. If $R$ was high, we want $\pi_\theta(a|s)$ to increase. If $R$ was low, we want it to decrease. The policy gradient algorithm does exactly this by adjusting $\theta$ in the direction that increases expected reward.

🎛 Interactive: Policy Distribution Update

Watch how the action distribution changes as we apply policy gradient updates. Blue = initial, Red = updated after receiving reward signal.

The policy gradient The gradient of expected return w.r.t. policy parameters θ. Points in the direction of steepest increase in expected reward. tells us how to adjust $\theta$ to increase expected return. The key insight is that we can estimate this gradient using samples from the policy—no model of the environment is needed.

§04

The Policy Gradient Theorem

(A) Intuition: Increase Probability of Good Actions

🧠 Core Idea

We want to compute $\nabla_\theta J(\theta)$ so we can do gradient ascent. The challenge: $J(\theta)$ depends on the distribution of trajectories induced by $\pi_\theta$, which also depends on $\theta$. The Policy Gradient Theorem gives us a clean, computable form for this gradient.

(B) Formal Statement

Define the performance objective as expected cumulative discounted reward:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

The Policy Gradient Theorem states:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^{\pi_\theta}(s_t, a_t)\right]$$

where $Q^{\pi_\theta}(s_t, a_t) = \mathbb{E}\left[\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\right]$ is the action-value function (expected future reward from $(s_t, a_t)$).

(C) Detailed Step-by-Step Derivation

1
Start from the Expected Return Objective

The performance objective is the expected return over trajectories $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$:

$$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[R(\tau)\right] = \int p_\theta(\tau) R(\tau)\, d\tau$$

where the trajectory probability decomposes as:

$$p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T} \pi_\theta(a_t \mid s_t) \cdot P(s_{t+1} \mid s_t, a_t)$$

Note: $\rho_0$ is the initial state distribution and $P$ is the (unknown) transition dynamics. The key observation: the policy $\pi_\theta$ is the only term that depends on $\theta$.

2
Take the Gradient

Differentiate with respect to $\theta$:

$$\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) R(\tau)\, d\tau = \int \nabla_\theta p_\theta(\tau) \cdot R(\tau)\, d\tau$$

We move the gradient inside the integral (valid under mild regularity conditions). Now we need to compute $\nabla_\theta p_\theta(\tau)$.

3
Apply the Log-Derivative (REINFORCE) Trick

The log-derivative trick (also called the likelihood ratio trick) uses the identity:

$$\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \cdot \nabla_\theta \log p_\theta(\tau)$$

This follows from $\nabla_\theta \log f = \frac{\nabla_\theta f}{f}$, rearranged. Substituting:

$$\nabla_\theta J(\theta) = \int p_\theta(\tau) \cdot \nabla_\theta \log p_\theta(\tau) \cdot R(\tau)\, d\tau = \mathbb{E}_{\tau \sim p_\theta}\left[\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)\right]$$
⚡ Why This Is Brilliant

We've converted an integral over a changing distribution into an expectation under $p_\theta$—which we can estimate by sampling trajectories! The environment dynamics vanish in the next step.

4
Expand the Log-Probability of a Trajectory

Taking the log of the trajectory probability:

$$\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^T \log \pi_\theta(a_t \mid s_t) + \sum_{t=0}^T \log P(s_{t+1} \mid s_t, a_t)$$

Taking the gradient with respect to $\theta$:

$$\nabla_\theta \log p_\theta(\tau) = \underbrace{\nabla_\theta \log \rho_0(s_0)}_{=0} + \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) + \underbrace{\nabla_\theta \log P(\cdot)}_{=0}$$

The initial state distribution and environment dynamics do not depend on $\theta$, so their gradients are zero. We're left with only policy terms!

5
Arrive at the REINFORCE Gradient

Substituting back, we get the REINFORCE policy gradient estimator:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\left(\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right) \cdot R(\tau)\right]$$

This is an unbiased estimator of the true gradient. We can estimate it from samples without knowing the environment dynamics.

6
Causality: Use Future Rewards Only

A key insight: past rewards cannot be influenced by current actions. We can replace $R(\tau)$ with future reward (reward-to-go):

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \sum_{t'=t}^T \gamma^{t'-t} r_{t'}\right]$$

This reduces variance without introducing bias! The term $\sum_{t'=t}^T \gamma^{t'-t} r_{t'}$ is the sample estimate of $Q^{\pi_\theta}(s_t, a_t)$.

7
Final Form: The Policy Gradient Theorem

The complete, general form of the Policy Gradient Theorem (Sutton et al., 2000):

$$\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^{\pi_\theta}(s_t, a_t)\right]}$$

This holds for both episodic and continuing tasks. The key components are:

  • $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ — the score function (how to increase log-prob of $a_t$)
  • $Q^{\pi_\theta}(s_t, a_t)$ — the quality signal (how good was this action?)
✨ Key Insight

The gradient says: move $\theta$ to increase the log-probability of action $a_t$, but scale this increase by how good that action was (its Q-value). High-reward actions get amplified; low-reward actions get suppressed.

§05

REINFORCE Algorithm

REINFORCE (Williams, 1992) is the simplest instantiation of the policy gradient theorem. It estimates $Q^{\pi}(s_t, a_t)$ using Monte Carlo returns—complete trajectory rollouts.

$$\hat{G}_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} \quad \text{(Monte Carlo return)}$$

The REINFORCE update rule:

$$\theta \leftarrow \theta + \alpha \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \hat{G}_t$$
Algorithm: REINFORCE
Initialize policy parameters θ randomly
for episode = 1, 2, 3, ... do
// Collect a full trajectory using current policy
τ = {s₀,a₀,r₀, s₁,a₁,r₁, ..., s_T} ~ π_θ
for t = 0, 1, ..., T do
Ĝ_t = Σ_{t'=t}^{T} γ^{t'−t} r_{t'} // Monte Carlo return
end for
// Gradient ascent on expected return
θ ← θ + α · Σ_t ∇_θ log π_θ(a_t|s_t) · Ĝ_t
end for
⚠ High Variance Problem

REINFORCE is unbiased but suffers from very high variance. Different trajectories starting from the same state can yield vastly different $\hat{G}_t$ values due to stochasticity. This makes learning slow and unstable. The solution: baselines and advantage functions (next section).

📐 Score Function / Log-Derivative

The term $\nabla_\theta \log \pi_\theta(a|s)$ is called the score function or log-derivative. For a Gaussian policy: $\nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \nabla_\theta \mu_\theta(s)$. For a softmax policy: $\nabla_\theta \log \pi_\theta(a|s) = \phi(s,a) - \sum_{a'} \pi_\theta(a'|s) \phi(s,a')$ where $\phi$ is a feature vector.

§06

Variance Reduction

Baselines

A baseline $b(s)$ is any function of the current state (not the action) that we subtract from the return without introducing bias. The modified gradient:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (Q^{\pi}(s_t,a_t) - b(s_t))\right]$$

Why is this unbiased? Because $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$.

✨ Optimal Baseline

The variance-minimizing baseline is $b^*(s) = \frac{\mathbb{E}[\|\nabla_\theta \log \pi\|^2 Q]}{\mathbb{E}[\|\nabla_\theta \log \pi\|^2]}$. In practice, we use the value function $V^\pi(s)$ as a convenient, near-optimal baseline.

The Advantage Function

Using the value function $V^\pi(s_t)$ as baseline gives the advantage function:

$$A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$$

The advantage function A(s,a) measures how much better action a is compared to the average action in state s. Positive advantage = better than average; negative = worse than average. answers the question: "How much better is action $a$ compared to the average action in state $s$?"

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t)\right]$$

🎛 Visualizing the Advantage Function

The advantage tells you which actions are above (green) or below (red) the state's average value. The gradient only amplifies actions relative to this baseline.

Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015) provides a bias-variance trade-off using temporal differences:

$$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual. Parameter $\lambda \in [0,1]$: $\lambda=0$ gives pure TD (low variance, high bias); $\lambda=1$ gives Monte Carlo (high variance, low bias).

§07

Actor–Critic Connection

Actor-Critic methods maintain two separate function approximators:

🎭 Actor

Policy $\pi_\theta(a|s)$
Decides which action to take

🧑‍⚖️ Critic

Value $V_w(s)$ or $Q_w(s,a)$
Evaluates how good the action was

🔄 Feedback

Advantage $A = Q - V$
Signal to update actor

🧠 Intuition: Actor and Critic

Think of an actor on stage and a critic reviewing the performance. The actor tries different actions; the critic evaluates them. The actor uses the critic's feedback to improve. This synergy combines the advantages of both value-based (lower variance via critic) and policy-based (works in continuous spaces) methods.

A2C / A3C Architecture

The actor loss (gradient ascent on policy):

$$\mathcal{L}_\text{actor}(\theta) = -\mathbb{E}_t\left[\log \pi_\theta(a_t|s_t) \cdot \hat{A}_t\right]$$

The critic loss (value function regression):

$$\mathcal{L}_\text{critic}(w) = \mathbb{E}_t\left[(V_w(s_t) - \hat{G}_t)^2\right]$$

Often combined with an entropy bonus to encourage exploration:

$$\mathcal{L}(\theta, w) = \mathcal{L}_\text{actor} + c_1 \mathcal{L}_\text{critic} - c_2 H(\pi_\theta(\cdot|s_t))$$
§08

Advanced Policy Gradient Methods

Natural Policy Gradient (NPG)

Standard gradient descent treats all parameter directions equally. But the policy space has a Riemannian geometry: equal changes in $\theta$ can have wildly different effects on $\pi_\theta$. The natural gradient accounts for this using the Fisher Information Matrix F(θ) = E[∇logπ · ∇logπᵀ] measures the curvature of the KL divergence between nearby policies. It encodes how policy-space distances relate to parameter-space distances. :

$$F(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^\top\right]$$
$$\tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1} \nabla_\theta J(\theta) \quad \text{(Natural Policy Gradient)}$$

NPG makes updates that are invariant to the parameterization of $\theta$—a critical property for neural networks where the same policy can be represented by many different $\theta$ values.

Trust Region Policy Optimization (TRPO)

TRPO (Schulman et al., 2015) formalizes the intuition that policy updates should not be too large. It solves a constrained optimization problem:

$$\max_\theta \; \mathbb{E}_{s,a \sim \pi_{\theta_\text{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} A^{\pi_{\theta_\text{old}}}(s,a)\right]$$ $$\text{subject to } \; \mathbb{E}_s\left[D_\text{KL}(\pi_{\theta_\text{old}} \| \pi_\theta)[s]\right] \leq \delta$$

The KL divergence constraint KL(p‖q) = Σ p(x) log(p(x)/q(x)). Here it measures how much the new policy deviates from the old one. TRPO requires this to be small (≤ δ ≈ 0.01). ensures the new policy stays close to the old one, maintaining the validity of the importance-weighted objective. TRPO uses conjugate gradients and line search, making it theoretically sound but computationally expensive.

§09

PPO — Derivation & Intuition

Proximal Policy Optimization (Schulman et al., 2017) is arguably the most important modern policy gradient algorithm. It achieves TRPO's stability with a fraction of the computational cost.

(A) Starting Point: Policy Gradient Objective with Importance Sampling

When we collect data under policy $\pi_{\theta_\text{old}}$ but optimize $\theta \neq \theta_\text{old}$, we need importance sampling:

$$\mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_{\theta_\text{old}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} f(a)\right]$$

Define the probability ratio:

$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$$

The surrogate objective (CPI — Conservative Policy Iteration):

$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \cdot \hat{A}_t\right]$$
⚠ Problem with Unrestricted CPI

Without constraints, maximizing $L^\text{CPI}$ can lead to destructively large policy updates. If $r_t(\theta)$ becomes very large, the importance-weighted advantage estimate becomes unreliable (high variance, potential bias).

(B) Introducing the KL Constraint (TRPO Approach)

TRPO adds an explicit KL constraint. The theoretical guarantee (from the surrogate objective bound):

$$J(\theta) \geq L^{\text{CPI}}(\theta) - C \cdot \max_s D_\text{KL}(\pi_{\theta_\text{old}} \| \pi_\theta)[s]$$

This shows the true objective $J(\theta)$ is lower bounded by the surrogate minus a KL penalty. Maximizing this lower bound gives monotonic improvement guarantees.

(C) Trust Region Idea

🧠 Intuition: Trust Region

Imagine you're on a hill (performance landscape) and want to move uphill. But your map (surrogate objective) is only accurate near your current position. A trust region says: "only take steps within a radius where the map is trustworthy." TRPO enforces this via KL constraint; PPO enforces it via clipping.

(D) PPO Clipped Objective — Full Derivation

1
Motivation: Why Clip Instead of Constrain?

TRPO requires solving a constrained optimization problem at each step, which involves:

  • Computing the Fisher Information Matrix (or its inverse)
  • Conjugate gradient to find the natural gradient direction
  • Line search to enforce the KL constraint

This is expensive and complex. PPO achieves similar empirical performance by directly clipping the objective, preventing large probability ratios without solving a constrained problem.

2
Analyze the Unclipped Objective Behavior

For the surrogate $L^{\text{CPI}} = \mathbb{E}_t[r_t(\theta) \hat{A}_t]$, consider two cases:

Case 1: $\hat{A}_t > 0$ (action was better than average)

Gradient ascent increases $r_t(\theta) = \pi_\theta / \pi_{\theta_\text{old}}$, meaning $\pi_\theta(a_t|s_t)$ increases. But if we over-optimize, $r_t$ can become very large—the policy changes too much.

Case 2: $\hat{A}_t < 0$ (action was worse than average)

Gradient ascent decreases $r_t(\theta)$, meaning $\pi_\theta(a_t|s_t)$ decreases. Again, without constraint, $r_t$ can become very small (near zero)—also a large policy change.

In both cases, we want to limit how far $r_t(\theta)$ deviates from 1.

3
Define the Clipped Ratio

PPO clips the probability ratio to stay within $[1-\epsilon, 1+\epsilon]$ (typically $\epsilon = 0.2$):

$$\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) = \begin{cases} 1-\epsilon & \text{if } r_t(\theta) < 1-\epsilon \\ r_t(\theta) & \text{if } 1-\epsilon \leq r_t(\theta) \leq 1+\epsilon \\ 1+\epsilon & \text{if } r_t(\theta) > 1+\epsilon \end{cases}$$

This prevents the policy from changing too much from $\pi_{\theta_\text{old}}$ in a single update step.

4
Derive the PPO-Clip Objective

PPO takes the minimum of the clipped and unclipped objectives, forming a pessimistic lower bound:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

Why take the minimum? Analyze by cases:

When $\hat{A}_t > 0$: We want to increase $r_t$. But $\min(\cdot)$ caps the benefit at $r_t = 1+\epsilon$. Beyond this, the gradient becomes zero—no incentive to move the policy further.

When $\hat{A}_t < 0$: We want to decrease $r_t$. The $\min(\cdot)$ caps the penalty at $r_t = 1-\epsilon$. Beyond this, the gradient is again zero.

Result: Gradient is zero whenever the ratio is outside $[1-\epsilon, 1+\epsilon]$. The policy is never pushed to move further than the trust region allows.

5
Full PPO Objective with All Components

The full PPO objective combines the clipped policy loss, value function loss, and entropy bonus:

$$L^{\text{PPO}}(\theta) = \mathbb{E}_t\left[L^{\text{CLIP}}_t(\theta) - c_1 L^{\text{VF}}_t(\theta) + c_2 H[\pi_\theta](s_t)\right]$$

where:

  • $L^{\text{CLIP}}$ = clipped policy gradient loss
  • $L^{\text{VF}}_t = (V_\theta(s_t) - V_t^\text{target})^2$ = value function squared error
  • $H[\pi_\theta](s_t) = -\sum_a \pi_\theta(a|s_t)\log\pi_\theta(a|s_t)$ = entropy bonus for exploration
  • $c_1 \approx 0.5$, $c_2 \approx 0.01$ are coefficients

🎛 PPO Clipping — Interactive Visualization

The PPO objective (orange) vs unclipped CPI (blue). Drag the advantage slider to see how clipping prevents large updates.

✨ Why PPO Stabilizes Learning

1. Bounded policy change: Clipping ensures $\pi_\theta$ cannot deviate too much from $\pi_{\theta_\text{old}}$ per update.
2. Multiple epochs on same data: Because updates are small, we can safely do K epochs (typically K=4-10) on the same collected data without divergence.
3. No 2nd-order methods: Unlike TRPO, PPO uses standard SGD/Adam, making it scalable to large neural networks (including LLMs).

§10

Stochastic Approximation View

Policy gradient methods are instances of stochastic approximation (Robbins-Monro, 1951). We want to find $\theta^* = \arg\max_\theta J(\theta)$ where $J(\theta)$ cannot be computed exactly, only estimated.

$$\theta_{k+1} = \theta_k + \alpha_k \widehat{\nabla_\theta J}(\theta_k)$$

where $\widehat{\nabla_\theta J}$ is an unbiased or consistent estimator of the true gradient. The Robbins-Monro conditions for convergence:

$$\sum_{k=1}^\infty \alpha_k = \infty \qquad \text{and} \qquad \sum_{k=1}^\infty \alpha_k^2 < \infty$$

Typical choice: $\alpha_k = \frac{1}{k}$ or $\alpha_k = \frac{c}{k + c'}$. In practice, fixed learning rates with Adam often work better.

📖 Key Property

Gradient estimator bias: REINFORCE is unbiased ($\mathbb{E}[\hat{g}] = \nabla_\theta J$) but high variance. Actor-critic with approximate value function introduces bias (from approximation error) but lower variance. This bias-variance tradeoff is central to all policy gradient methods.

§11

Deep Policy Gradient Methods

Modern deep RL combines policy gradients with deep neural networks as function approximators. Key developments:

Deep Deterministic Policy Gradient (DDPG)

For continuous action spaces, DDPG learns a deterministic policy $\mu_\theta(s)$ using the chain rule:

$$\nabla_\theta J \approx \mathbb{E}_s\left[\nabla_a Q_w(s,a)\big|_{a=\mu_\theta(s)} \cdot \nabla_\theta \mu_\theta(s)\right]$$

Soft Actor-Critic (SAC)

SAC maximizes a entropy-augmented objective, trading off reward for policy entropy:

$$J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\pi}\left[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))\right]$$

SAC is off-policy (uses a replay buffer), entropy-regularized, and highly sample efficient—making it one of the most popular algorithms for continuous control.

Practical Tricks for Stability

TrickPurpose
Gradient clippingPrevent exploding gradients in deep networks
Normalization of advantagesReduce sensitivity to reward scale
Reward scaling/clippingStabilize learning across environments
Target networks (in critic)Stabilize TD learning
Orthogonal initializationBetter gradient flow in deep policies
§12

RLHF Connection

(A) What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is the technique that transforms a pretrained language model into a helpful, harmless, and honest AI assistant. It aligns model behavior with human preferences using RL—specifically policy gradient methods.

🧠 Core Analogy

In standard RL, the reward signal comes from the environment (e.g., game score). In RLHF, the "environment" is a human evaluator who tells the model which responses are better. Since we can't query humans at every step, we train a reward model to simulate human preferences, then optimize the LLM policy against this reward model using PPO.

(B) The RLHF Pipeline

Step 1

Pretraining
Train LLM on large text corpus (next-token prediction)

Step 2

SFT
Supervised fine-tuning on high-quality demonstrations

Step 3

Reward Model
Train $R_\phi$ on human pairwise preference data

Step 4

PPO
Optimize $\pi_\theta$ to maximize $R_\phi$ with KL constraint

(C) Role of Policy Gradient

The LLM is the policy $\pi_\theta$. Generating a response is a trajectory of token selections. The reward model provides a scalar reward at the end of generation:

$$J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\left[R_\phi(x, y) - \beta \cdot D_\text{KL}(\pi_\theta \| \pi_\text{ref})\right]$$

The policy gradient update then pushes the LLM to generate responses that get higher reward from the reward model.

§13

RLHF for LLMs (In Depth)

How GPT-style Models are Post-Trained

Starting from a pretrained model $\pi_\text{SFT}$, the RLHF loop works as follows:

RLHF Training Loop (InstructGPT / ChatGPT style)
// Phase 1: Collect comparison data
for each prompt x do
Sample k responses: y₁,...,yₖ ~ π_SFT(·|x)
Human annotators rank: y_{i₁} ≻ y_{i₂} ≻ ... ≻ y_{iₖ}
end for
// Phase 2: Train reward model
Train R_φ using Bradley-Terry preference model:
L(φ) = -E_{(x,y_w,y_l)}[log σ(R_φ(x,y_w) - R_φ(x,y_l))]
// Phase 3: PPO fine-tuning
for each PPO iteration do
y ~ π_θ(·|x) // Generate response
r = R_φ(x,y) - β·KL(π_θ‖π_ref) // Penalized reward
Update θ via PPO using advantage estimates from r
end for

Reward Modeling

The reward model $R_\phi$ is trained using the Bradley-Terry model for pairwise comparisons:

$$P(y_w \succ y_l \mid x) = \frac{e^{R_\phi(x, y_w)}}{e^{R_\phi(x, y_w)} + e^{R_\phi(x, y_l)}} = \sigma(R_\phi(x, y_w) - R_\phi(x, y_l))$$

Training loss (negative log-likelihood of preferences):

$$\mathcal{L}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma(R_\phi(x, y_w) - R_\phi(x, y_l))\right]$$

KL Regularization: Why It's Critical

Without regularization, the policy would reward hack—find adversarial inputs that exploit weaknesses in the reward model while generating incoherent text. The KL penalty prevents this:

$$r(x, y) = R_\phi(x, y) - \beta \cdot \underbrace{D_\text{KL}(\pi_\theta(\cdot|x) \| \pi_\text{ref}(\cdot|x))}_{\text{KL penalty from reference model}}$$
✨ Three Roles of the KL Penalty

1. Prevents reward hacking: Keeps the policy from exploiting reward model weaknesses.
2. Maintains language quality: Prevents the policy from drifting into incoherent text.
3. Acts as regularization: Ensures the final model retains general capabilities from pretraining.

Per-Token Reward Formulation

In practice, the reward is applied per-token for the PPO advantage computation:

$$r_t = \begin{cases} \beta \cdot \log \frac{\pi_\text{ref}(a_t|s_t)}{\pi_\theta(a_t|s_t)} & t < T \\ R_\phi(x,y) - \beta \cdot \log \frac{\pi_\text{ref}(a_T|s_T)}{\pi_\theta(a_T|s_T)} & t = T \end{cases}$$

This distributes the terminal reward across tokens, enabling per-token advantage estimation.

DPO: Beyond PPO for LLMs

Direct Preference Optimization (Rafailov et al., 2023) bypasses the reward model entirely by showing that the optimal policy under KL-regularized RLHF satisfies:

$$\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]$$

DPO implicitly trains the reward model and policy simultaneously, dramatically simplifying the pipeline while achieving comparable results.

§14

Modern Perspective

From RL to LLM Alignment: The Full Picture

RL Concept LLM Equivalent Notes
AgentLanguage modelGenerates tokens autoregressively
State $s_t$Context $x + y_{1:t}$All tokens seen so far
Action $a_t$Next token $y_t$From vocabulary (~50k items)
Policy $\pi_\theta$LLM parameters $\theta$Softmax over vocabulary
Reward $r$Human preference scoreSparse: only at end of response
EpisodeOne query-response pairVariable length
Environment dynamicsDeterministic (token appending)No stochastic transitions!

Current Research Frontiers

🔬 Active Research Areas

Constitutional AI (Anthropic): Self-critique and revision without human feedback at scale.
RLAIF: Replace human preferences with AI preferences (Claude/GPT-4 as judge).
Process Reward Models: Reward each reasoning step, not just final answer (Math, coding).
GRPO (DeepSeek): Group Relative Policy Optimization—removes value network, uses group baselines.
Online DPO: Iteratively generate new preference pairs and update the policy.
KTO: Kahneman-Tversky Optimization—uses unpaired preferences, simpler data collection.

§15

Summary

🎓 Key Takeaways

1. Policy Gradient Theorem: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$. Unbiased, model-free, works in continuous spaces.

2. Log-derivative trick converts an intractable gradient of an expectation into an expectation of a gradient—making Monte Carlo estimation possible.

3. Variance reduction via advantage functions $A^\pi = Q^\pi - V^\pi$ is essential for practical training.

4. PPO achieves TRPO-like stability via clipping, making it the de facto standard for both game AI and LLM alignment.

5. RLHF maps perfectly to the RL framework: LLM = policy, tokens = actions, human preference = reward, KL penalty = trust region.

6. The same mathematics that trained Atari agents in 2013 now powers GPT-4, Claude, and Gemini—a beautiful example of algorithmic leverage.

Algorithm Key Idea Pros Cons
REINFORCEMonte Carlo PGUnbiasedHigh variance
A2C/A3CAdvantage actor-criticLower varianceBiased critic
TRPOKL-constrained updateMonotonic improvement2nd-order, slow
PPOClipped surrogateStable + scalableHeuristic clipping
SACEntropy-augmentedEfficient, continuousOff-policy complexity
PPO+RLHFHuman reward modelAligns to preferencesReward hacking risk
DPOImplicit rewardNo reward model neededStatic data, may be unstable
§16

Historical Timeline

1992

REINFORCE (Williams)

Simple statistical gradient following. Introduced the log-derivative trick for policy optimization. Foundational but high-variance.

1999–2000

Policy Gradient Theorem (Sutton, McAllester, Singh, Mansour)

Formal proof connecting policy gradients to Q-values. Natural Policy Gradient introduced by Kakade (2001), connecting to information geometry.

2013–2015

Deep RL Renaissance

DQN (Mnih et al., 2013) demonstrates deep RL at scale. DDPG (2015) extends to continuous actions. A3C (Mnih et al., 2016) enables parallelism.

2015–2017

TRPO → PPO (Schulman et al.)

TRPO provides monotonic improvement guarantees. PPO (2017) simplifies via clipping, becoming the dominant algorithm for both games (OpenAI Five) and LLM alignment.

2019–2022

RLHF Emerges for Language Models

Ziegler et al. (2019) apply RLHF to GPT-2. InstructGPT / ChatGPT (2022) demonstrates RLHF at scale: PPO + reward model trained on human preferences transforms GPT-3 into a helpful assistant.

2023

DPO & LLM Alignment Research

DPO bypasses reward model entirely. Constitutional AI, RLAIF, and process rewards emerge. GPT-4, Claude, Gemini all trained with RLHF variants.

2024–Present

RL for Reasoning (DeepSeek-R1, o1)

GRPO and outcome reward models enable models to learn complex multi-step reasoning. RL from verifiable rewards (math, code) achieves superhuman performance on reasoning benchmarks.


Soumyadeep Roy,MTech(Res),IISc Bangalore