Sequence-level vs. Token-level Importance Sampling in RL for LLMs
There is a recurring discussion about sequence-level vs. token-level importance sampling in RL for LLMs (PPO / TRPO / GRPO). One view treats the entire completion as a single action; the other treats each token as a separate action. The choice determines the importance weights and the bias-variance tradeoff, and much of the confusion comes from conflating the two formulations. I initially found the sequence-level view more convincing, but have come around to the token-level one.
States, actions, and TRPO/PPO
When we apply RL to LLMs, we need to decide what counts as a “state” and what counts as an “action.” There are two natural choices. In the sequence-level view, the state is the prompt and the action is the entire completion—the model generates a full response in one shot. In the token-level view, the state is the prompt plus all tokens generated so far (the prefix), and the action is the next single token. The two formulations lead to different objectives with different bias-variance characteristics.
In either formulation, the objective we ultimately care about is the expected reward under the current policy: \[\begin{equation} \mathbb{E}_{s \sim d_\pi}\,\mathbb{E}_{a \sim \pi(\cdot \mid s)}\bigl[R(s,a)\bigr], \end{equation}\] where \(s\) is a state, \(a\) is an action, \(d_\pi\) is the state distribution induced by \(\pi\), and \(R(s,a)\) is the immediate reward.
This formulation is natural in the bandit setting (one state, one action, collect reward), but in a multi-step setting like token-by-token generation, the immediate reward at step \(t\) does not capture the consequences of the action on future steps. What we actually want to maximize is the expected future reward. This is what the Q-function provides: \[\begin{equation} Q^\pi(s_t, a_t) = \mathbb{E}\!\left[R(y) \;\middle|\; s_t,\, a_t,\, a_{t+k} \sim \pi \right], \end{equation}\] the expected terminal reward given that we take action \(a_t\) in state \(s_t\) and follow \(\pi\) for all subsequent steps. (We write \(R(y)\) for the reward at the end of the complete sequence \(y\), which is the typical setup in RL for LLMs—e.g., a verifier score.) The proper objective becomes \[\begin{equation} \mathbb{E}_{s \sim d_\pi}\,\mathbb{E}_{a \sim \pi(\cdot \mid s)}\bigl[Q^\pi(s,a)\bigr]. \end{equation}\] Crucially, \(Q^\pi\) depends on the current policy—it reflects \(\pi\)’s future behavior, not just the immediate reward.
Optimizing this directly requires fresh rollouts from \(\pi\) at every update—expensive when each rollout means running a large language model. In practice, rollout generation and training are decoupled, so we almost always train on data from a slightly stale policy \(\pi_{\mathrm{old}}\).
TRPO and PPO do not optimize this objective directly. Instead, they accept that the states come from rollouts under \(\pi_{\mathrm{old}}\) and correct only the action distribution via importance sampling. The surrogate takes the form: \[\begin{equation} \mathbb{E}_{s \sim d_{\pi_{\mathrm{old}}}}\,\mathbb{E}_{a \sim \pi_{\mathrm{old}}}\!\left[ \frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)}\,Q^{\pi_{\mathrm{old}}}(s,a) \right] - \mathrm{regularization}\!\left(\pi,\, \pi_{\mathrm{old}}\right). \end{equation}\] This introduces three kinds of mismatch relative to the true objective:
Action mismatch: the actions were chosen by \(\pi_{\mathrm{old}}\), not \(\pi\). This is corrected explicitly by the importance weight \(\pi(a \mid s)/\pi_{\mathrm{old}}(a \mid s)\).
State mismatch: the states were visited under \(\pi_{\mathrm{old}}\), not \(\pi\).
Value mismatch: the value estimates use \(Q^{\pi_{\mathrm{old}}}\) rather than \(Q^\pi\). In other words, \(Q^{\pi_{\mathrm{old}}}\) answers “how good is this action if we follow \(\pi_{\mathrm{old}}\) afterwards?” when we really want “how good is this action if we follow \(\pi\) afterwards?” If \(\pi\) has improved since \(\pi_{\mathrm{old}}\), the Q-values are pessimistic; if \(\pi\) has gotten worse in some region, they are optimistic.
The state and value mismatches are not corrected explicitly—instead, the regularization term constrains \(\pi\) to stay close to \(\pi_{\mathrm{old}}\) (via KL penalties or clipping), keeping these mismatches small. The TRPO paper bounds the gap between the surrogate and the true objective in terms of the KL divergence between the two policies, showing that all three mismatches are controlled simultaneously by the trust region.
This is where the choice of state/action split matters.
In the sequence-level formulation, the only “state” is the prompt, and prompts come from the dataset—they do not depend on \(\pi\) and do not change when the policy updates. The problem reduces to a contextual bandit: one state (the prompt), one action (the full completion), no intermediate steps whose distribution could drift. Both the state-distribution mismatch and the value mismatch vanish: there are no intermediate states to drift, and the reward \(R(y)\) is a fixed function of the completed sequence rather than a policy-dependent \(Q^\pi\).
That said, PPO-style regularization is still useful here, just for a different reason. As \(\pi\) diverges from \(\pi_{\mathrm{old}}\), the importance weights \(\pi(y)/\pi_{\mathrm{old}}(y)\) become extreme: many are near zero and a few are huge. The effective batch size shrinks and rare large weights dominate the gradient (see the previous post for a detailed treatment of IS variance and effective sample size). Keeping \(\pi\) close to \(\pi_{\mathrm{old}}\) via KL penalties or clipping keeps these weights well-behaved and training stable.
In the token-level formulation, the “states” are prefixes—and prefixes are generated by the policy itself. Change \(\pi\) and you change which prefixes appear during rollouts. This is the setting PPO/TRPO were designed for: we sample prefixes from \(\pi_{\mathrm{old}}\), correct only the single-token action at each position via per-token importance ratios, and rely on KL constraints or clipping to ensure \(\pi\) stays close enough to \(\pi_{\mathrm{old}}\) that the old prefixes remain representative. All three mismatches are present here—state distribution, action distribution, and value estimates—and the trust region is what keeps them under control.
Both formulations are valid. The sequence-level objective is cleaner but suffers from high importance-sampling variance; the token-level surrogate is biased but more stable, and aligns with how PPO/TRPO/GRPO are designed to operate.
REINFORCE: on-policy vs. off-policy
When training is fully on-policy (every sequence is freshly sampled from \(\pi\)), the split does not matter. There are no importance weights, so there is no product-vs-sum question. The log-probability decomposes as \(\log \pi(y) = \sum_{t} \log \pi(y_t \mid y_{<t})\), and differentiating gives the same REINFORCE gradient whether you view generation as one big action or many small ones.
The distinction shows up once we go off-policy—training on sequences from a stale \(\pi_{\mathrm{old}}\) rather than the current \(\pi\). Now we need importance weights, and the two views diverge. The sequence-level objective uses a single weight for the whole sequence—a product of per-token ratios. The token-level objective applies a separate weight at each position: \[\begin{equation} L_{\mathrm{seq}}(\theta) = \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[ \prod_{t} \frac{\pi(y_t \mid y_{<t})}{\pi_{\mathrm{old}}(y_t \mid y_{<t})}\, R(y) \right], \qquad L_{\mathrm{tok}}(\theta) = \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[ \sum_{t} \frac{\pi(y_t \mid y_{<t})}{\pi_{\mathrm{old}}(y_t \mid y_{<t})}\, R(y) \right]. \end{equation}\] Here both objectives use the same terminal reward \(R(y)\). (With a learned value function one could substitute per-token advantages, but the common RLHF setup scores the full sequence.) \(L_{\mathrm{seq}}\) is the “correct” unbiased objective, but it has a practical problem: the product \(\prod_t r_t\) of many importance ratios has variance that grows exponentially with sequence length. A single unusual token can make the entire product huge or tiny, making the gradient estimate useless. For long generations this makes \(L_{\mathrm{seq}}\) very difficult to use in practice.
\(L_{\mathrm{tok}}\) is not an unbiased estimator of the same objective—it is a different surrogate. It works well in practice for the following reason. When \(\pi\) stays close to \(\pi_{\mathrm{old}}\) (as enforced by clipping or KL penalties), each per-token ratio \(r_t = \pi(y_t \mid y_{<t}) / \pi_{\mathrm{old}}(y_t \mid y_{<t})\) is close to \(1\). In that regime, the product of ratios is well approximated by a sum: \[\begin{equation} \prod_{t} r_t = \exp\!\biggl(\sum_{t} \log r_t\biggr) \;\approx\; 1 + \sum_{t}(r_t - 1), \end{equation}\] using \(\log r_t \approx r_t - 1\) and \(\exp(\text{small}) \approx 1 + \text{small}\). This linearization is accurate precisely when the trust-region constraint holds.
Token-level objectives are not “wrong.” They are local surrogates that degrade gracefully as \(\pi\) drifts from \(\pi_{\mathrm{old}}\), and they are valid in the regime that PPO/TRPO/GRPO-style algorithms are designed to maintain.