Intro: REINFORCE as Importance Sampling

Consider the standard RL setup for LLMs: a policy \(\pi_{\theta}\) (the language model) generates outputs \(y\), and a reward function \(f(y)\) scores each output. We want to find parameters \(\theta\) that maximize the expected reward: \[\begin{equation} J(\theta) = \mathbb{E}_{y\sim \pi_{\theta}}\bigl[f(y)\bigr]. \end{equation}\] For simplicity we treat \(y\) as a single random variable; the extension to sequences \(y=(y_1,\dots,y_T)\) is straightforward.

How do we compute \(\nabla_{\theta} J(\theta)\)? Taking \(\nabla_{\theta}\) through an expectation looks problematic because sampling \(y\sim\pi_{\theta}\) is not differentiable via autodiff. But all random variables here are discrete and finite (with a finite horizon for token sequences), so “expectation” is just a sum.

Let \(\mathcal{Y}\) denote the finite support of \(y\). Then \[\begin{equation} J(\theta)=\sum_{y\in\mathcal{Y}} \pi_{\theta}(y)\,f(y). \end{equation}\] Differentiating term-by-term gives \[\begin{equation} \nabla_{\theta} J(\theta) = \sum_{y\in\mathcal{Y}} \nabla_{\theta}\pi_{\theta}(y)\,f(y). \end{equation}\] This is mathematically correct but computationally useless: \(\mathcal{Y}\) can be prohibitively large. Since \(J(\theta)\) itself is cheap to approximate by Monte Carlo, the trick is to rewrite the gradient as an expectation under \(\pi_{\theta}\) as well.

Multiply and divide by \(\pi_{\theta}(y)\): \[\begin{align} \nabla_{\theta} J(\theta) &= \sum_{y\in\mathcal{Y}} \pi_{\theta}(y)\,f(y)\,\frac{\nabla_{\theta}\pi_{\theta}(y)}{\pi_{\theta}(y)}\\ &= \mathbb{E}_{y\sim \pi_{\theta}}\left[f(y)\,\frac{\nabla_{\theta}\pi_{\theta}(y)}{\pi_{\theta}(y)}\right]\\ &= \mathbb{E}_{y\sim \pi_{\theta}}\bigl[f(y)\,\nabla_{\theta}\log \pi_{\theta}(y)\bigr]. \tag{REINFORCE} \end{align}\] The right-hand side is now an expectation we can estimate from samples — this is the classic REINFORCE policy-gradient estimator.

While REINFORCE uses \(\nabla_{\theta}\log\pi_{\theta}(y)\), off-policy algorithms like PPO and GRPO work with the likelihood ratio \(\pi_{\theta}(y)/\pi_{\mathrm{old}}(y)\) directly. Both viewpoints connect through importance sampling.

To implement REINFORCE with autodiff, multiply the integrand by \(\pi_{\theta}(y)/\mathrm{StopGrad}(\pi_{\theta}(y))\), which evaluates to \(1\): \[\begin{equation} \mathbb{E}_{y\sim \pi_{\theta}}\bigl[f(y)\bigr] = \mathbb{E}_{y\sim \pi_{\theta}}\left[\frac{\pi_{\theta}(y)}{\mathrm{StopGrad}(\pi_{\theta}(y))}\,f(y)\right]. \end{equation}\] The expectation is unchanged, but the gradient now flows through the numerator, producing the \(\nabla_{\theta}\log\pi_{\theta}(y)\) term automatically.

Importance sampling perspective.

Suppose we want to estimate \(\mathbb{E}_{y\sim p}[f(y)]\) but can only sample from \(q\). As long as \(q(y)>0\) whenever \(p(y)>0\): \[\begin{equation} \mathbb{E}_{y\sim p}\bigl[f(y)\bigr] = \mathbb{E}_{y\sim q}\!\left[\frac{p(y)}{q(y)}\,f(y)\right]. \end{equation}\] Each sample from \(q\) is reweighted by the likelihood ratio \(p(y)/q(y)\).

Apply this to the policy-gradient setting. When samples come from an old policy \(\pi_{\mathrm{old}}\), we estimate an expectation under \(\pi_{\theta}\) as \[\begin{equation} \mathbb{E}_{y\sim \pi_{\mathrm{old}}}\left[\frac{\pi_{\theta}(y)}{\pi_{\mathrm{old}}(y)}\,f(y)\right]. \end{equation}\] This is genuine importance sampling: reweighting samples from \(\pi_{\mathrm{old}}\) to evaluate \(\pi_{\theta}\). The likelihood ratio \(\pi_{\theta}(y)/\pi_{\mathrm{old}}(y)\) is precisely the term that replaces \(\nabla_{\theta}\log\pi_{\theta}(y)\) in off-policy methods like TRPO, PPO, and GRPO. The denominator \(\pi_{\mathrm{old}}(y)\) is already effectively stop-grad since we do not backpropagate through the old policy.

The on-policy case is a special instance where \(\pi_{\mathrm{old}} = \mathrm{StopGrad}(\pi_{\theta})\): \[\begin{equation} \mathbb{E}_{y\sim \pi_{\theta}}\bigl[f(y)\bigr] = \mathbb{E}_{y\sim \mathrm{StopGrad}(\pi_{\theta})}\left[\frac{\pi_{\theta}(y)}{\mathrm{StopGrad}(\pi_{\theta}(y))}\,f(y)\right]. \end{equation}\] In both cases, gradients flow through the numerator \(\pi_{\theta}(y)\), producing the correct \(\nabla_{\theta}\log\pi_{\theta}(y)\) term.

Variance of the importance-sampled estimator.

The direct Monte Carlo estimator with \(y_i \sim p\) has variance \(\operatorname{Var}_p[f(y)] / N\). The importance-sampled version, with \(y_i \sim q\) and likelihood ratios \(r(y) = p(y)/q(y)\), has variance \[\begin{equation} \operatorname{Var}_q\!\bigl[r(y)\,f(y)\bigr] / N. \end{equation}\] The extra factor \(r(y)\) is what inflates variance. When \(p\) and \(q\) are close, \(r \approx 1\) and the two estimators behave similarly. But when they diverge, samples that land in regions where \(q\) assigns low probability but \(p\) assigns high probability receive huge weights. Most samples contribute little (small \(r\)), while a few dominate the sum (large \(r\)).

To make this concrete, decompose the variance: \[\begin{equation} \operatorname{Var}_q\!\bigl[r(y)\,f(y)\bigr] = \mathbb{E}_q\!\bigl[r(y)^2\,f(y)^2\bigr] - \bigl(\mathbb{E}_q\!\bigl[r(y)\,f(y)\bigr]\bigr)^2. \end{equation}\] The first term involves \(r(y)^2\), so even a single sample with a large ratio can make this blow up. In the policy-gradient setting, \(r(y) = \pi_{\theta}(y)/\pi_{\mathrm{old}}(y)\): if the new policy upweights a sequence that the old policy considered unlikely, that sample’s contribution to the gradient estimate is amplified quadratically in the variance.

Effective sample size.

Computing this variance requires the expectation \(\mathbb{E}_q[r(y)^2 f(y)^2]\) — itself an intractable sum over \(\mathcal{Y}\), no easier than the original problem. A practical proxy that sidesteps this is the effective sample size (ESS). In Monte Carlo statistics, given \(N\) weighted samples, ESS answers “how many i.i.d. samples from the target distribution would yield an estimator with the same variance?” It reduces to a simple function of the ratios \(r_i\).

Given \(N\) samples with likelihood ratios \(r_i = \pi_{\theta}(y_i)/\pi_{\mathrm{old}}(y_i)\), the standard estimator is \[\begin{equation} \widehat{\mathrm{ESS}} = \frac{\left(\sum_i r_i\right)^2}{\sum_i r_i^2}. \end{equation}\] When all ratios are similar, \(\widehat{\mathrm{ESS}} \approx N\); when a few dominate, it collapses toward \(1\). In the on-policy case all ratios are exactly \(1\), so \(\widehat{\mathrm{ESS}} = N\) — every sample contributes equally. ESS becomes interesting only when reusing samples from an older policy.

Since \(\mathbb{E}[r] = 1\) in standard importance sampling, \(\sum_i r_i \approx N\) in large batches, giving \[\begin{equation} \widehat{\mathrm{ESS}} \approx \frac{N^2}{\sum_i r_i^2}. \end{equation}\] An equivalent form makes the dependence on ratio variance explicit: \[\begin{equation} \widehat{\mathrm{ESS}} = \frac{N}{1 + \operatorname{Var}(r)}. \end{equation}\] As the ratios become more dispersed, ESS shrinks — the batch effectively gets smaller.

ESS depends only on the ratios and ignores the rewards and gradients, so it measures policy mismatch. A sample with a large weight but near-zero advantage barely affects the gradient — ESS would flag it, but the estimator is fine.

Clipping as importance weight truncation.

Large importance weights dominating the estimate is a well-known problem in statistics. Common remedies include truncated importance sampling (capping weights at a threshold) and masked importance sampling (zeroing out samples whose weights exceed a bound). Both prevent ESS collapse at the cost of some bias.

PPO’s clipping fits this framework: clipping the likelihood ratio to \([1-\epsilon, 1+\epsilon]\) zeroes out gradient contributions from samples where \(\pi_{\theta}\) has drifted too far from \(\pi_{\mathrm{old}}\), keeping ESS high and the gradient estimate stable.