Offline Reinforcement Learning

Learn optimal policies purely from pre-collected datasets — without any interaction with the environment. The paradigm shift that makes RL practical for the real world.

No Environment Interaction Dataset-Driven Safety-Critical Domains Also: Batch RL
🎮

What is Offline RL?

Offline RL (also called Batch RL) learns a policy from a fixed, static dataset of previously logged transitions (s, a, r, s') collected by some behavior policy.

No trial-and-error. No exploring the environment. Just pure supervised-like learning from historical data — with the full complexity of sequential decision-making.

Why It Matters

In the real world, exploration is expensive, dangerous, or impossible. You can't let an RL agent crash a robot 10,000 times to learn locomotion, or administer dangerous drug combinations to patients.

Offline RL unlocks RL for healthcare, autonomous driving, robotics, and recommendation systems using historical logs.

🔄 Online RL

  • Interacts with environment in real-time
  • Can explore to gather new data
  • Policy improves iteratively
  • Requires a simulator or live system
  • Unsafe in high-stakes domains
  • Can recover from suboptimal behavior

📦 Offline RL

  • Fixed dataset — no env interaction
  • Cannot explore — bounded by data
  • Must extract the best policy from data
  • Works with historical logs
  • Safe, deployable in critical domains
  • Bottlenecked by dataset quality

A Brief History

Offline RL didn't emerge overnight — it evolved from decades of research in batch learning, approximate dynamic programming, and eventually deep RL.

90s
Early 1990s

Fitted Value Iteration & Batch RL Origins

Ernst, Lagoudakis, and Parr formalize Batch Reinforcement Learning. Fitted Q-Iteration (FQI) applies supervised regression to approximate Q-values from a fixed dataset.

00s
2000s

Neural Fitted Q-Iteration & LSTD

Neural networks replace linear function approximators. Least-Squares TD (LSTD) provides stable off-policy evaluation. The field recognizes the distribution shift problem more concretely.

18
2018–2019

Deep Offline RL — The Problem Becomes Clear

Fujimoto et al. demonstrate that naively applying deep RL off-policy (like TD3, SAC) to static datasets catastrophically fails due to extrapolation error. BCQ (2019) is the first deep offline RL algorithm to explicitly address this.

20
2020–2021

Modern Offline RL — CQL, BEAR, IQL

A wave of principled algorithms: Conservative Q-Learning (CQL), BEAR, TD3+BC, and Implicit Q-Learning (IQL). Levine et al. publish a comprehensive survey formalizing the offline RL problem. D4RL benchmark introduced.

22+
2022–Present

Transformers, Foundation Models & RLHF

Decision Transformer treats offline RL as sequence modeling. RLHF uses offline RL ideas to align LLMs. Foundation models for control emerge. Offline-to-online fine-tuning becomes a key research direction.

Problem Setup

Offline RL is grounded in the standard MDP framework, with one crucial difference: the agent has no access to the environment at training time.

Markov Decision Process (MDP)

𝒮

State Space

Set of all possible world states. Sensor readings, positions, inventory levels.

𝒜

Action Space

Set of agent actions. Continuous (torques) or discrete (move left/right).

T, R

Dynamics & Reward

T(s'|s,a) transition probability. R(s,a,s') immediate reward signal.

Goal: Maximize Expected Return

π* = argmaxπ 𝔼τ~π [ Σt=0 γt · r(st, at) ]

where γ ∈ [0,1) is the discount factor and τ is a trajectory sampled under policy π.

The Offline Dataset

Instead of interacting with the environment, we have access only to a fixed dataset collected by some behavior policy β(a|s):

𝒟 = { (si, ai, ri, s'i) }i=1N    where   ai ~ β(·|si)

Data Collection → Offline Training → Deployment

Environment
Real / Simulator
Behavior Policy β
Human / Scripted / Any
Dataset 𝒟
{s,a,r,s'} × N
Offline RL Algorithm
CQL / BCQ / IQL / ...
Learned Policy π
No env queries!
Deployment
Real environment
⚠️ The learned policy π may never query the environment during training

Dataset Types in Practice

🎯
Expert Data
High quality, near-optimal. Behavior cloning works. Small distribution.
🔀
Mixed Data
Expert + random. Challenging but realistic. Wide coverage.
🎲
Random / Suboptimal
Wide but low quality. Requires strong credit assignment.

Core Challenges

Offline RL is deceptively difficult. The same algorithms that work online can catastrophically fail offline — and here's exactly why.

🌊

Distributional Shift

Critical

The learned policy π may induce a different state-action distribution than the behavior policy β that collected the data. The Q-function was trained on (s,a) pairs from β's distribution, but at test time we query it at points π might visit — which could be far from the training distribution.

dπ(s,a) ≠ dβ(s,a)  →  Trained Q may be unreliable on π's trajectory
📈

Extrapolation Error & OOD Actions

Severe

Q-networks trained with Bellman backups can overestimate Q-values for out-of-distribution (OOD) actions — actions not seen in the dataset. Policy improvement greedily selects high-Q actions, which may be exactly the OOD actions with erroneously high estimated values.

This creates a deadly feedback loop: high Q-estimates → policy takes OOD actions → Bellman targets use these overestimated values → Q values diverge.

🔬 Interactive: OOD Extrapolation Demo

The grid below shows a 1D action space. Blue cells = actions in the dataset. Red cells = OOD actions with overestimated Q-values. Drag the slider to see how a greedy policy gets seduced by OOD actions.

In-distribution action
OOD action (overestimated Q)
True Q-value
Estimated Q-value (greedy selects ★)
🔁

Bootstrapping Error Accumulation

Compounding

Temporal Difference learning bootstraps — it uses Q̂(s',a') to update Q̂(s,a). In offline settings, the target Q̂(s',a') may be evaluated at an OOD (s',a') pair, producing garbage targets. These errors accumulate across multi-step backups and can diverge.

Bellman Backup:
Q(s,a) ← r + γ · maxa' Q(s', a')    ← a' might be OOD!

Algorithms

Offline RL algorithms differ in how they address extrapolation error — through behavior cloning, conservative value estimation, policy constraints, or a mix.

A

Behavior Cloning (BC)

Baseline

The simplest approach: treat it as supervised learning. Clone the behavior policy by maximizing log-likelihood of actions in the dataset. Ignores the reward signal entirely.

minπ 𝔼(s,a)~𝒟 [ -log π(a|s) ]

Strengths: Simple, stable, no OOD issue (by design). Weaknesses: Cannot improve beyond the behavior policy. Compounding errors in sequential settings (DAgger problem).

for each (s, a, r, s') in 𝒟: loss = cross_entropy(π(s), a) # or MSE for continuous gradient_step(loss) # Note: reward r is completely ignored!
B

Value-Based Methods

Conservative Q-Learning (CQL)
CQL

Penalizes Q-values on OOD (state, action) pairs by adding a regularization term that lowers Q-values for actions not in the dataset, while raising them for in-distribution actions.

Key idea: Learn a conservative Q-function such that Q(s,a) is a lower-bound on the true Qπ(s,a) for π, preventing overestimation.
CQL = α · [𝔼a~μ[Q(s,a)] - 𝔼a~β[Q(s,a)]] + ℒTD

Q-value estimates: Standard vs CQL

Standard TD ❌
In-dist a₁
In-dist a₂
OOD a₃
OOD a₄
CQL ✓
In-dist a₁
In-dist a₂
OOD a₃
OOD a₄
Fitted Q-Iteration (FQI)
Classic

Repeatedly regresses a Q-function onto Bellman targets computed from the fixed dataset. A foundational algorithm that modern methods build upon.

Key idea: Treat each Bellman backup as a supervised regression problem over the fixed dataset. Iterate until convergence.
# Initialize Q arbitrarily for k = 1 to K: y_i = r_i + γ·maxa' Qk(s'_i, a') Qk+1regress(Q, {(s_i,a_i) → y_i}) # Policy: π(s) = argmax_a Q_K(s,a)
C

Policy Constraint Methods

BCQ
Batch-Constrained Q

Fujimoto et al. (2019). Restricts the policy to only select actions similar to those in the dataset, using a generative model (VAE) to model the behavior policy distribution.

Key idea: Generate candidate actions from a VAE trained on 𝒟, then perturb slightly, then select the highest-Q candidate. Never queries OOD actions by construction.
π(s) = argmaxa∈G(s) Q(s,a)   where G(s) ⊆ support(β(·|s))
BEAR
MMD Constraint

Kumar et al. (2019). Instead of hard support constraints, BEAR uses Maximum Mean Discrepancy (MMD) to softly constrain the learned policy to stay close to the behavior policy in distribution.

Key idea: Add a kernel-based MMD penalty between the learned policy π and behavior policy β. More flexible than BCQ — allows some deviation from data distribution.
minπ -J(π) + λ · MMD(π(·|s), β(·|s))
D

Actor-Critic Offline Methods

TD3+BC
Simple & Strong

Fujimoto & Gu (2021). Surprisingly simple: add a behavior cloning term to the TD3 actor loss. The BC term prevents the policy from deviating too far from the data distribution.

Key idea: Normalize Q-values and add a weighted BC term. The α hyperparameter balances exploitation (Q) vs conservatism (BC).
π = argmaxa [ λ·Q(s,a) - (a - πβ(s))² ]

where λ = α / (1/N · Σ|Q(s,a)|) normalizes Q magnitudes.

IQL
Implicit Q-Learning

Kostrikov et al. (2021). Avoids querying OOD actions entirely during training by using expectile regression to implicitly optimize for in-sample actions only.

Key idea: Learn a value function V(s) and advantage A(s,a) without ever evaluating the policy on OOD actions. Extract the policy via advantage-weighted regression.
V = 𝔼(s,a)~𝒟 [Lτ2(Q(s,a) - V(s))]

π = argmaxa 𝔼 [exp(β·A(s,a)) · log π(a|s)]

Lτ is the asymmetric expectile loss (τ > 0.5 picks upper quantile).

Theory

Understanding offline RL theoretically requires analyzing how distributional mismatch propagates through Bellman backups and corrupts policy evaluation and improvement.

📐 Error Decomposition in Offline Policy Evaluation

+

The policy evaluation error for offline RL can be decomposed into two terms:

|Vπ(s) - V̂π(s)| ≤ εapprox/(1-γ) + γ·Cπ/β·εdata/(1-γ)²

where:

  • εapprox — function approximation error of Q̂
  • Cπ/β — concentration coefficient: how much π can deviate from β
  • εdata — estimation error from finite dataset samples

The concentration coefficient Cπ/β is defined as the maximum density ratio:

Cπ/β = maxs,a dπ(s,a) / dβ(s,a)

This can be unbounded if π visits states not covered by β — this is why offline RL is fundamentally hard. Policy constraint methods aim to keep Cπ/β bounded.

Error contribution breakdown:

Approximation error
Distribution shift
Finite-sample noise

🎯 Concentration Coefficients & Coverage

+

A dataset 𝒟 has good coverage of a policy π if the density ratio is bounded:

Cπ/β = ‖ dπ / dβ < ∞

In practice, we can only guarantee this if β has full support over the state-action space (every (s,a) has positive probability under β). This is rarely true — hence offline RL is hard.

Stronger algorithms (like CQL) achieve data-dependent bounds that degrade gracefully with partial coverage:

Vπ*(s) - V̂π̂(s) ≤ O( Cπ*/β · √(1/|𝒟|) )

where π* is the optimal policy and π̂ is CQL's learned policy.

🔄 Bellman Backup Error Propagation

+

In standard (online) RL, the Bellman operator 𝒯π is a contraction:

‖ 𝒯πQ - 𝒯πQ' ‖ ≤ γ ‖ Q - Q' ‖

But in offline RL, we approximate 𝒯π using the dataset, introducing error. After k Bellman backups, errors accumulate:

‖ Qk - Qπ ‖ ≤ γk‖ Q0 - Qπ ‖ + Σt=0k-1 γt · εt

where εt is the approximation error at each step. The error geometrically accumulates — for OOD actions, εt can be large, leading to divergence in practice.

Key insight: Single-step methods (like BC) have no bootstrapping errors but cannot generalize beyond the data. Multi-step methods can improve but risk compounding errors.

📊 Pessimism Principle & Lower-Bound Guarantees

+

The pessimism principle (Jin et al., 2021; Rashidinejad et al., 2021) provides a principled solution: be pessimistic about uncertain (OOD) state-action pairs.

Q̂(s,a) = Q̂MLE(s,a) - Γ(s,a)

where Γ(s,a) is a uncertainty bonus (large for OOD pairs). The pessimistic policy:

π̂ = argmaxππ   (with pessimistic Q̂)

Under single-policy concentrability, this achieves near-optimal performance without requiring full coverage:

Vπ* - Vπ̂ ≤ O( √(Cπ*/β / |𝒟|) )

CQL and IQL can be seen as practical realizations of this pessimism principle.

Practical Considerations

Getting offline RL to work in practice requires more than just picking a good algorithm. Dataset quality, evaluation methodology, and hyperparameter sensitivity all matter enormously.

📊

Dataset Quality

The quality of 𝒟 directly caps what any offline RL algorithm can learn. Key properties:

  • Coverage: Are relevant state-action pairs represented?
  • Quality: What fraction are near-optimal transitions?
  • Diversity: Multiple behavior policies → richer data
  • Size: Larger datasets → tighter bounds
🔍

Off-Policy Evaluation (OPE)

How do you evaluate your learned policy without deploying it? This is a fundamental research problem (OPE).

  • IS / WIS: Importance sampling methods
  • DM: Direct model-based estimation
  • DR / DRaIPS: Doubly-robust methods
  • FQE: Fitted Q-Evaluation
⚙️

Hyperparameter Sensitivity

Offline RL algorithms can be brittle to hyperparameter choices:

  • α in CQL: Too large → over-conservative; too small → OOD issues
  • τ in IQL: Expectile quantile controls optimism level
  • Normalization: Q-value normalization in TD3+BC is critical
🏋️

Benchmarks (D4RL)

The D4RL benchmark (Fu et al., 2020) is the standard evaluation suite:

  • Locomotion: HalfCheetah, Hopper, Walker2d
  • Antmaze: navigation with sparse rewards
  • Adroit: dexterous manipulation tasks
  • Kitchen: multi-task manipulation

Algorithm Comparison at a Glance

Algorithm OOD Handling Complexity Continuous Convergence
BC ✓ By Design Simple Stable
BCQ ✓ VAE Medium Good
CQL ✓ Penalty Medium Strong
TD3+BC ✓ BC Term Simple Strong
IQL ✓ In-sample Medium SOTA

Limitations

Offline RL is powerful but not a silver bullet. Understanding these limitations is crucial for knowing when to use it and what to expect.

Modern Research Directions

Offline RL is an active research area with strong connections to language models, foundation models, and real-world deployment. Here are the most exciting current directions.

Offline → Online

Offline-to-Online RL

Use offline pre-training to initialize a policy, then fine-tune with online interaction. Combining the safety of offline training with the adaptability of online exploration. Methods like IQL, Cal-QL, and PEX tackle this transition.

Key challenge: Avoiding catastrophic forgetting and distribution shift during the offline→online transition.
Language Models

RLHF & Offline RL

Reinforcement Learning from Human Feedback (RLHF) — used to align GPT-4, Claude, and Gemini — is fundamentally an offline RL problem. Preference data forms a static dataset; the reward model and policy optimization mirror offline RL concepts.

Connection: DPO (Direct Preference Optimization) can be viewed as offline RL with implicit rewards from human comparisons.
Sequence Modeling

Decision Transformer

Chen et al. (2021) reframe offline RL as a conditional sequence modeling problem. Instead of value functions, a GPT-like transformer is trained to predict actions conditioned on return-to-go targets. Surprisingly competitive with value-based methods.

Variants: Gato, Trajectory Transformer, Q-Transformer — scaling this idea to multi-task and multi-domain settings.
Foundation Models

Generalist Agents

Large pre-trained models (RT-2, Gato, VPT) are trained on massive multi-task offline datasets and fine-tuned for specific tasks. The bet: enough data diversity enables generalization to novel tasks at deployment.

Challenge: Scaling laws for offline RL are less understood than for language models. Data quality vs quantity tradeoffs remain open.

🔭 Open Problems Worth Your PhD

Partial Observability

Offline RL in POMDPs — the behavior policy may have had access to information not captured in the logged observations.

Data Augmentation

Model-based data augmentation to artificially expand offline datasets (MOPO, MOReL). Tradeoff between model bias and coverage.

Multi-Task & Few-Shot

Meta-offline RL — pre-train on diverse tasks offline, then adapt to new tasks with minimal online interaction.

Scalable OPE

Reliable off-policy evaluation at scale. Current methods have high variance or require strong assumptions.

Reward Learning

Offline RL from unlabeled data or human preferences, without a well-defined reward signal in the dataset.

Safety Guarantees

Constrained offline RL — learning policies that satisfy safety constraints provably, without environment interaction.

Built for researchers & students exploring Offline RL

Key papers: BCQ (Fujimoto 2019) · CQL (Kumar 2020) · IQL (Kostrikov 2021) · DT (Chen 2021) · D4RL (Fu 2020)