<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Zoey Li's Personal Webpage</title>
    <description>PhD Candidate @UIUC</description>
    <link>https://zoeyli.com</link>
    <atom:link href="https://zoeyli.com/feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>Off-Policy Corrections in LLM RL Training</title>
        <description>&lt;blockquote&gt;
  &lt;p&gt;A unified view of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-unified-problem&quot;&gt;The Unified Problem&lt;/h2&gt;

&lt;p&gt;Every off-policy issue in LLM RL reduces to the same fundamental problem:&lt;/p&gt;

&lt;div class=&quot;notice--info&quot;&gt;
  &lt;p&gt;\(\text{We sample from } \pi_{\text{actual}} \text{ but optimize as if samples came from } \pi_{\text{assumed}}\)&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;When $\pi_{\text{actual}} \neq \pi_{\text{assumed}}$, the policy gradient becomes biased. The general correction is importance sampling:&lt;br /&gt;&lt;/p&gt;

\[\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_{\text{actual}}} \left[ \frac{\pi_{\text{assumed}}(y)}{\pi_{\text{actual}}(y)} \cdot f(y) \cdot \nabla_\theta \log \pi_\theta(y) \right]\]

&lt;p&gt;&lt;strong&gt;Theoretical grounding&lt;/strong&gt; [1]: The token-level optimization objective used by REINFORCE, GRPO, and related algorithms is a &lt;strong&gt;first-order approximation&lt;/strong&gt; to the true sequence-level reward objective. This approximation is valid only when each token’s IS ratio $\delta_t = \frac{\pi_\theta(y_t)}{\mu_{\theta_{old}}(y_t)} - 1$ is small, so that second-order terms ($\delta_i \delta_j$) can be neglected. The token-level IS weight is therefore &lt;strong&gt;inherent&lt;/strong&gt; to the approximation — not an optional correction bolted on. Removing it invalidates the surrogate objective entirely.&lt;/p&gt;

&lt;p&gt;In LLM RL, five distinct sources create this mismatch, each with different causes, magnitudes, and corrections:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;#&lt;/th&gt;
      &lt;th&gt;Source&lt;/th&gt;
      &lt;th&gt;$\pi_{\text{actual}}$&lt;/th&gt;
      &lt;th&gt;$\pi_{\text{assumed}}$&lt;/th&gt;
      &lt;th&gt;Magnitude&lt;/th&gt;
      &lt;th&gt;When it arises&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;Multi-epoch policy drift&lt;/td&gt;
      &lt;td&gt;$\pi_{\theta_{old}}$ (start of epoch)&lt;/td&gt;
      &lt;td&gt;$\pi_\theta$ (current params)&lt;/td&gt;
      &lt;td&gt;Small per epoch&lt;/td&gt;
      &lt;td&gt;PPO/GRPO multi-epoch training&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;Backend mismatch&lt;/td&gt;
      &lt;td&gt;$\pi_{\text{sampler}}(\theta)$ (vLLM/SGLang)&lt;/td&gt;
      &lt;td&gt;$\pi_{\text{learner}}(\theta)$ (FSDP/Megatron)&lt;/td&gt;
      &lt;td&gt;Small but systematic under fp16&lt;/td&gt;
      &lt;td&gt;Different engines for rollout vs training&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;Async staleness&lt;/td&gt;
      &lt;td&gt;$\pi_{\theta_k}$ ($k$ steps old)&lt;/td&gt;
      &lt;td&gt;$\pi_\theta$ (current)&lt;/td&gt;
      &lt;td&gt;Can be large&lt;/td&gt;
      &lt;td&gt;Async RL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;MoE routing&lt;/td&gt;
      &lt;td&gt;$\pi_\theta$ with route_old&lt;/td&gt;
      &lt;td&gt;$\pi_\theta$ with route_new&lt;/td&gt;
      &lt;td&gt;Variable, depending on model routing stability&lt;/td&gt;
      &lt;td&gt;MoE architectures after gradient updates&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;Tool-call trajectories&lt;/td&gt;
      &lt;td&gt;$\pi_\theta \times P_{\text{env}}$ (joint)&lt;/td&gt;
      &lt;td&gt;$\pi_\theta$ (LM only)&lt;/td&gt;
      &lt;td&gt;Variable, can be large&lt;/td&gt;
      &lt;td&gt;Agentic RL with tool use&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Sources 1–3 are independent and their corrections compose cleanly. Source 4 (MoE routing) is qualitatively different — it’s a discrete structural change that &lt;strong&gt;amplifies&lt;/strong&gt; Sources 2 and 3 rather than being fully independent. Source 5 (tool-call trajectories) is also qualitatively different — the mismatch comes from non-policy tokens in the conditioning context rather than from policy weight differences.&lt;/p&gt;

&lt;h2 id=&quot;the-is-aggregation-problem-why-no-practical-method-is-exact&quot;&gt;The IS Aggregation Problem: Why No Practical Method Is Exact&lt;/h2&gt;

&lt;p&gt;Before diving into individual sources, it’s worth understanding a fundamental limitation that underlies all of them.&lt;/p&gt;

&lt;p&gt;For a sequence-level reward $R(x,y)$, the &lt;strong&gt;exact&lt;/strong&gt; IS gradient requires the full product of per-token ratios:&lt;/p&gt;

\[\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \mu_{\theta_{old}}}\left[\prod_{t=1}^{\lvert y \rvert} \frac{\pi_\theta(y_t \mid y_{&amp;lt;t})}{\mu_{\theta_{old}}(y_t \mid y_{&amp;lt;t})} \cdot R(x,y) \cdot \sum_{t} \nabla_\theta \log \pi_\theta(y_t \mid y_{&amp;lt;t})\right]\]

&lt;p&gt;The per-token ratio $\frac{\pi_\theta(y_t \mid y_{&amp;lt;t})}{\mu_{\theta_{old}}(y_t \mid y_{&amp;lt;t})}$ is the correct building block. But the &lt;strong&gt;product&lt;/strong&gt; of these ratios over a full sequence is intractable — for $\lvert y \rvert = 1000$ tokens, even with each ratio in $[0.99, 1.01]$, the product ranges from $\approx 0.00004$ to $\approx 22026$. Variance grows exponentially with sequence length, making gradient estimates useless.&lt;/p&gt;

&lt;p&gt;Every practical algorithm approximates this product differently:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;IS weight used&lt;/th&gt;
      &lt;th&gt;Relation to exact $\prod_t r_t$&lt;/th&gt;
      &lt;th&gt;Tradeoff&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Exact&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;$\prod_t r_t$ (full product)&lt;/td&gt;
      &lt;td&gt;Correct&lt;/td&gt;
      &lt;td&gt;Intractable variance&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;GRPO&lt;/strong&gt; (token-level)&lt;/td&gt;
      &lt;td&gt;Per-token $r_t$ in PPO surrogate (gradients flow through)&lt;/td&gt;
      &lt;td&gt;Same first-order token-level approx: $\prod(1+\delta_t) \approx 1 + \sum \delta_t$&lt;/td&gt;
      &lt;td&gt;Low bias when ratios $\approx$ 1; breaks down far from on-policy&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;CISPO&lt;/strong&gt; (token-level, detached)&lt;/td&gt;
      &lt;td&gt;Per-token $\text{clip}(r_t)$ as detached weight on log-prob&lt;/td&gt;
      &lt;td&gt;Same first-order token-level approx, but ratio is clipped and stop-gradient&lt;/td&gt;
      &lt;td&gt;Preserves all tokens; slight bias from weight clipping&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;GSPO&lt;/strong&gt; (sequence geometric mean)&lt;/td&gt;
      &lt;td&gt;$(\prod_t r_t)^{1/\vert y\vert}$ applied uniformly&lt;/td&gt;
      &lt;td&gt;$\vert y\vert$-th root of exact product — a heuristic, not a principled approximation&lt;/td&gt;
      &lt;td&gt;Low variance; unclear what bias it introduces&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;The fundamental tension&lt;/strong&gt;: sequence-level rewards demand sequence-level IS, but sequence-level IS has exponential variance in autoregressive models. There is no free lunch — every method trades bias for variance differently. The practical question is which approximation degrades most gracefully as the five mismatch sources push ratios away from 1.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;source-1-multi-epoch-policy-drift&quot;&gt;Source 1: Multi-Epoch Policy Drift&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: PPO and GRPO reuse the same batch of rollouts for multiple gradient steps (epochs). After the first update, the policy $\pi_\theta$ has drifted from the sampling policy $\pi_{\theta_{old}}$.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correction&lt;/strong&gt;: The standard IS ratio with clipping, built into the algorithm:&lt;/p&gt;

\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}, \quad L = \min\left(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\]

&lt;p&gt;This is the most well-studied mismatch and is already handled by the algorithm’s own clipping mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interaction with KL&lt;/strong&gt;: When using “KL in reward” (combined form), PPO’s clipping automatically provides IS correction for the KL term. When using “KL as loss” (decoupled form), the KL term needs its own explicit IS correction across epochs — applying the same importance ratio $\rho_k(\theta) = \pi_\theta / \pi_{\theta_k}$ to the KL loss, with clipping: $\min(\rho_k \cdot k_n’,\; \text{clip}(\rho_k, 1-\epsilon, 1+\epsilon) \cdot k_n’)$. See [2] for the full analysis.&lt;/p&gt;

&lt;h2 id=&quot;source-2-backend-mismatch-tis&quot;&gt;Source 2: Backend Mismatch (TIS)&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: In disaggregated RL systems, the rollout engine (vLLM/SGLang) and training engine (FSDP/Megatron) use different backends. Even with &lt;strong&gt;identical weights&lt;/strong&gt;, these backends produce different log-probabilities due to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Different attention kernels (e.g., FlashAttention variants)&lt;/li&gt;
  &lt;li&gt;Different operator fusion patterns&lt;/li&gt;
  &lt;li&gt;Quantization differences (FP8/INT8 rollout vs BF16 training)&lt;/li&gt;
  &lt;li&gt;Numerical precision handling (accumulated floating-point differences)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates an unintentional off-policy gap: $\pi_{\text{sampler}}(\theta) \neq \pi_{\text{learner}}(\theta)$.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it arises&lt;/strong&gt;: Whenever rollout and training use different software backends — which is the common case in both colocated and disaggregated setups. Note that “colocated” means the same GPUs are time-shared between rollout and training phases, but most colocated frameworks still use different engines (e.g., vLLM for inference, FSDP for training) on those GPUs. Backend mismatch is eliminated only when both phases use the &lt;em&gt;same engine&lt;/em&gt; (e.g., the training framework’s own &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;generate()&lt;/code&gt;), which sacrifices inference throughput significantly and is rare in practice.&lt;/p&gt;

&lt;h3 id=&quot;three-log-probabilities&quot;&gt;Three Log-Probabilities&lt;/h3&gt;

&lt;p&gt;When backend mismatch exists, we need &lt;strong&gt;three&lt;/strong&gt; distinct log-probabilities:&lt;/p&gt;

&lt;pre class=&quot;mermaid&quot;&gt;
flowchart LR
    subgraph rollout[&quot;Rollout Phase&quot;]
        direction TB
        P[&quot;Prompt&quot;] --&amp;gt; SAM[&quot;Sampler (vLLM)&quot;]
        SAM --&amp;gt; LSAM[&quot;① log π_sam(θ_old)&quot;]
    end

    subgraph training[&quot;Training Phase — Learner (FSDP)&quot;]
        direction TB
        FP1[&quot;③ FP#1 (θ, grads) → log π_learn(θ)&quot;]
        FP2[&quot;② FP#2 (θ_old, no grad) → log π_learn(θ_old)&quot;]
    end

    LSAM -- &quot;cached&quot; --&amp;gt; training
&lt;/pre&gt;

&lt;ol&gt;
  &lt;li&gt;$\log \pi_{\text{sampler}}(a, \theta_{\mathrm{old}})$ — from rollout backend at time of sampling (&lt;strong&gt;cached&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;$\log \pi_{\text{learner}}(a, \theta_{\mathrm{old}})$ — from training backend with $\theta_{\mathrm{old}}$ weights (&lt;strong&gt;FP#2, detached&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;$\log \pi_{\text{learner}}(a, \theta)$ — from training backend with current weights (&lt;strong&gt;FP#1, with gradients&lt;/strong&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;correction-approach-1-resampling&quot;&gt;Correction Approach 1: Resampling&lt;/h3&gt;

&lt;p&gt;Recompute log-probs using the training backend, discarding the sampler’s log-probs entirely:&lt;/p&gt;

\[\frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})} \quad \text{(both from same backend)}\]

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Pro&lt;/strong&gt;: Same-backend ratio; no mismatch in the PPO ratio itself&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Con&lt;/strong&gt;: Requires extra forward pass (FP#2); the expectation is still taken over $\pi_{\text{sampler}}$, so a distributional mismatch remains in the sampling distribution&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Used by&lt;/strong&gt;: VeRL (built-in implementation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;correction-approach-2-truncated-importance-sampling-tis&quot;&gt;Correction Approach 2: Truncated Importance Sampling (TIS)&lt;/h3&gt;

&lt;p&gt;Explicitly correct for the backend gap with a truncated importance ratio:&lt;/p&gt;

\[r_{\text{TIS}} = \min\left(\frac{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})}{\pi_{\text{sampler}}(a, \theta_{\mathrm{old}})}, C\right)\]

&lt;p&gt;where $C$ is a loose cap (typically 10–100, much larger than PPO’s $1+\epsilon$). The full loss becomes:&lt;/p&gt;

\[\mathcal{L} = -\frac{1}{B \times G} \sum_{i} r_{\text{TIS}} \cdot \min\left(r_{\text{PPO}} \cdot A_i, \text{clip}(r_{\text{PPO}}, 1-\epsilon, 1+\epsilon) \cdot A_i\right)\]

&lt;p&gt;with two separate ratios serving distinct roles:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;$r_{\text{TIS}}$ corrects for &lt;strong&gt;backend mismatch&lt;/strong&gt; (should be $\approx 1.0$ if backends are close)&lt;/li&gt;
  &lt;li&gt;$r_{\text{PPO}} = \frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})}$ handles &lt;strong&gt;policy drift&lt;/strong&gt; (same backend, no mismatch)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gradient flows through $r_{\text{PPO}}$ but $r_{\text{TIS}}$ is detached (no gradient through old policy).&lt;/p&gt;

&lt;h3 id=&quot;why-not-combine-into-a-single-ratio&quot;&gt;Why Not Combine Into a Single Ratio?&lt;/h3&gt;

&lt;p&gt;A natural idea: skip the separation and use $\frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{sampler}}(a, \theta_{\mathrm{old}})}$ directly as the PPO ratio. This doesn’t work well in practice.&lt;/p&gt;

&lt;p&gt;From the Flash-RL team [3]:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Even when $\theta = \theta_{\mathrm{old}}$, the probability ratio $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{sampler}}(a, \theta_{\mathrm{old}})}$ is already not equal to 1 due to the mismatch — this makes the clipping happen with high possibility and the training much less informative. Furthermore, in our TIS method, we separately clip $\frac{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}{\pi_{\mathrm{sampler}}(a, \theta_{\mathrm{old}})}$ and $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}$ and thus much more mild; notice $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}$ equals to 1 when $\theta = \theta_{\mathrm{old}}$ which is suitable for the trust region constraint.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;The key insight&lt;/strong&gt;: PPO’s trust region is designed around the assumption that the ratio starts at 1.0. Backend mismatch violates this assumption, causing excessive clipping and uninformative gradients. Separating the two ratios preserves the trust region semantics.&lt;/p&gt;
&lt;/div&gt;

&lt;h3 id=&quot;aside-sequence-level-tolerance&quot;&gt;Aside: Sequence-Level Tolerance&lt;/h3&gt;

&lt;p&gt;The GSPO paper [4] claims that sequence-level likelihoods are more tolerant of backend precision differences, since small per-token numerical errors may cancel out when aggregated via the geometric mean. If true, this would let GSPO skip TIS entirely — reducing the training node to a single forward pass. However, the geometric mean is not a principled approximation to the exact sequence-level IS product (it’s the $\vert y\vert $-th root, a fundamentally different quantity). We mention it for completeness but would not rely on it as a general-purpose correction strategy.&lt;/p&gt;

&lt;h2 id=&quot;source-3-async-staleness&quot;&gt;Source 3: Async Staleness&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: In asynchronous RL systems, the rollout engine generates data using policy weights from $k$ training steps ago ($\pi_{\theta_k}$), while the training engine optimizes the current policy ($\pi_\theta$). The staleness $k$ depends on system throughput and scheduling.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2026-03-01-Off-Policy-Corrections-LLM-RL/async-staleness-timeline.drawio.png&quot; alt=&quot;Async staleness timeline showing rollout and training engines operating concurrently, with Batch B generated under θ₀ but trained after the θ₀→θ₁ update (staleness k=1).&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it arises&lt;/strong&gt;: Any system where rollout and training overlap in time:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Fully async systems (e.g., AReaL [5], Slime async mode)&lt;/li&gt;
  &lt;li&gt;Double-buffered / pipelined systems (generating batch N+1 while training on batch N)&lt;/li&gt;
  &lt;li&gt;Systems with slow weight synchronization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it’s worse than multi-epoch drift&lt;/strong&gt;: Multi-epoch drift is bounded (typically 1–4 epochs, each a small gradient step). Async staleness is unbounded without explicit control — the policy can drift arbitrarily far between when data was generated and when it’s consumed for training.&lt;/p&gt;

&lt;h3 id=&quot;prevention-staleness-rate-limiting&quot;&gt;Prevention: Staleness Rate-Limiting&lt;/h3&gt;

&lt;p&gt;Rather than correcting stale data after the fact, &lt;strong&gt;bound how stale data can get&lt;/strong&gt;:&lt;/p&gt;

\[\lfloor (N_r - 1) / B \rfloor \leq i + \eta\]

&lt;p&gt;Where:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;$N_r$ = total generated trajectories so far&lt;/li&gt;
  &lt;li&gt;$B$ = training batch size&lt;/li&gt;
  &lt;li&gt;$i$ = current policy version (increments each training step)&lt;/li&gt;
  &lt;li&gt;$\eta$ = maximum staleness budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rollout controller blocks new generation requests when this bound is violated, unblocking when training completes a step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended values&lt;/strong&gt; (from AReaL): $\eta = 4$ for coding tasks, $\eta = 8$ for math tasks. Tighter bounds mean less staleness but more pipeline bubbles; looser bounds improve throughput but degrade convergence.&lt;/p&gt;

&lt;h3 id=&quot;correction-decoupled-ppo&quot;&gt;Correction: Decoupled PPO&lt;/h3&gt;

&lt;p&gt;When data is stale, standard PPO’s trust region is misplaced — it clips around $\pi_{\theta_k}$ (the stale behavior policy) rather than the current policy. AReaL’s &lt;strong&gt;decoupled PPO&lt;/strong&gt; separates two roles:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Behavior policy&lt;/strong&gt; $\pi_{\text{behav}}$: the (possibly stale) policy that generated the data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Proximal policy&lt;/strong&gt; $\pi_{\text{prox}}$: current policy snapshot, used as the trust region center&lt;/li&gt;
&lt;/ul&gt;

\[J(\theta) = \mathbb{E}\left[ \frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \cdot \min\left(u_t^{\text{prox}}(\theta) \cdot \hat{A}_t, \text{clip}(u_t^{\text{prox}}(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]\]

&lt;p&gt;where $u_t^{\text{prox}}(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{prox}}(a_t \mid s_t)}$.&lt;/p&gt;

&lt;p&gt;The importance weight $\pi_{\text{prox}} / \pi_{\text{behav}}$ corrects for distribution shift. The clip operates around $\pi_{\text{prox}}$ (a recent, high-quality policy) rather than $\pi_{\text{behav}}$ (potentially stale).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation&lt;/strong&gt;: Store &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log_prob_behav&lt;/code&gt; during generation, tagged with the policy &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;weight_version&lt;/code&gt;. Before training, compute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log_prob_prox&lt;/code&gt; via a forward pass with the current policy snapshot.&lt;/p&gt;

&lt;h3 id=&quot;heuristic-corrections&quot;&gt;Heuristic Corrections&lt;/h3&gt;

&lt;p&gt;Two simpler (but less principled) alternatives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Off-Policy Sequence Masking (OPSM)&lt;/strong&gt;: Discard entire sequences whose importance ratio $\pi_\theta / \pi_{\theta_k}$ exceeds a threshold. Simple but wastes gradient signal from masked sequences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TIS for staleness&lt;/strong&gt;: Apply truncated importance sampling (same mechanism as Source 2) to the staleness ratio. Clips extreme ratios but doesn’t relocate the trust region.&lt;/p&gt;

&lt;h3 id=&quot;why-both-prevention-and-correction-are-needed&quot;&gt;Why Both Prevention and Correction Are Needed&lt;/h3&gt;

&lt;p&gt;AReaL’s ablations demonstrate that &lt;strong&gt;neither rate-limiting nor decoupled PPO alone is sufficient&lt;/strong&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Setup&lt;/th&gt;
      &lt;th&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Naive PPO, $\eta$=1&lt;/td&gt;
      &lt;td&gt;Degraded vs synchronous&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Naive PPO, $\eta$=4&lt;/td&gt;
      &lt;td&gt;Collapsed&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Decoupled PPO, $\eta$=$\infty$ (unbounded)&lt;/td&gt;
      &lt;td&gt;Degraded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Decoupled PPO, $\eta \leq 8$&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Matches synchronous oracle&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Prevention ($\eta$) bounds worst-case staleness; correction (decoupled PPO) handles the residual drift within that bound. The combination is what makes fully async training viable.&lt;/p&gt;

&lt;h2 id=&quot;source-4-moe-routing-mismatch&quot;&gt;Source 4: MoE Routing Mismatch&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: In Mixture-of-Experts models, the router selects which experts process each token. After a gradient update, the router’s decisions might change. This is more prominent in deeper MoE architectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it’s listed separately&lt;/strong&gt;: Unlike Sources 1–3, this mismatch involves a &lt;strong&gt;discrete structural change&lt;/strong&gt; (different experts activated) rather than continuous numerical drift. However, mechanistically, MoE routing is &lt;strong&gt;entangled with Sources 2 and 3&lt;/strong&gt; — it amplifies both backend mismatch and policy staleness.&lt;/p&gt;

&lt;h3 id=&quot;how-routing-amplifies-sources-2-and-3&quot;&gt;How Routing Amplifies Sources 2 and 3&lt;/h3&gt;

&lt;p&gt;The Qwen team’s analysis [1] shows that for MoE models, the token-level IS weight decomposes as:&lt;/p&gt;

\[\frac{\pi_\theta(y_t \mid x, y_{&amp;lt;t})}{\mu_{\theta_{old}}(y_t \mid x, y_{&amp;lt;t})} = \underbrace{\frac{\pi_{\theta_{old}}(y_t \mid x, y_{&amp;lt;t}, e^{\pi}_{old,t})}{\mu_{\theta_{old}}(y_t \mid x, y_{&amp;lt;t}, e^{\mu}_{old,t})}}_{\text{training-inference discrepancy}} \times \underbrace{\frac{\pi_\theta(y_t \mid x, y_{&amp;lt;t}, e^{\pi}_t)}{\pi_{\theta_{old}}(y_t \mid x, y_{&amp;lt;t}, e^{\pi}_{old,t})}}_{\text{policy staleness}}\]

&lt;p&gt;where $e^{\pi}$ and $e^{\mu}$ denote the routed experts in the training and inference engines respectively. Expert routing now appears &lt;strong&gt;inside both factors&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Training-inference discrepancy&lt;/strong&gt;: Even with identical weights, the training engine and inference engine may route to different experts ($e^{\pi}_{old,t} \neq e^{\mu}_{old,t}$), amplifying the numerical differences that already exist from different kernels/precision.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Policy staleness&lt;/strong&gt;: After gradient updates, not only do the model parameters change, but the routed experts also shift $e^{\pi}_t \neq e^{\pi}_{old,t}$, compounding the distribution shift.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This entanglement is why MoE RL training is fundamentally harder to stabilize than dense model training — routing noise makes the first-order approximation break down faster.&lt;/p&gt;

&lt;h3 id=&quot;correction-routing-replay&quot;&gt;Correction: Routing Replay&lt;/h3&gt;

&lt;p&gt;The core idea: fix the routed experts during policy optimization so the model behaves like a dense one for IS computation purposes. Two variants exist [1]:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2026-03-01-Off-Policy-Corrections-LLM-RL/routing-replay-r2-vs-r3.drawio.png&quot; alt=&quot;R2 vs R3 Routing Replay comparison: R2 replays the training engine&apos;s own routes (eπ_old), while R3 passes the inference engine&apos;s routing decisions (eμ_old) to training.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R2 (Vanilla Routing Replay)&lt;/strong&gt;: Replay the &lt;strong&gt;training engine’s&lt;/strong&gt; routing ($e^{\pi}_{old,t}$). Reduces the policy staleness component. For the first mini-batch of a global step, the target policy is unaltered; for subsequent mini-batches, the forced routing deviates from the model’s natural routing, biasing the optimization target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R3 (Rollout Routing Replay)&lt;/strong&gt;: Replay the &lt;strong&gt;inference engine’s&lt;/strong&gt; routing ($e^{\mu}_{old,t}$). Reduces the training-inference discrepancy by forcing the training engine to use the same experts the inference engine chose. Always alters the target policy (even in the first mini-batch), since the training engine is forced to use the inference engine’s routing decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical finding — R2 vs R3 depends on off-policiness&lt;/strong&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Off-policiness $N$ (global batch / micro-batch)&lt;/th&gt;
      &lt;th&gt;Better variant&lt;/th&gt;
      &lt;th&gt;Why&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Small ($N$=2)&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;R2&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;R2 preserves target policy in first mini-batch; R3’s bias outweighs its discrepancy reduction&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Large ($N \geq 4$)&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;R3&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Training-inference discrepancy dominates; R3’s reduction of this factor outweighs its target-policy bias&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Very large ($N$=8)&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;R3 essential&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;R2 fails to sustain stable training; only R3 + clipping remains viable&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;strong&gt;On-policy training ($N$=1)&lt;/strong&gt;: Neither R2 nor R3 provides benefit — Routing Replay introduces bias without compensating gain. The basic algorithm with IS correction (no Routing Replay) achieves the best performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared drawbacks of both variants&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Additional memory overhead (must store routing decisions per token)&lt;/li&gt;
  &lt;li&gt;Communication overhead in distributed settings&lt;/li&gt;
  &lt;li&gt;Introduces optimization bias by forcing non-natural expert assignments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;alternative-sequence-level-is-gspo&quot;&gt;Alternative: Sequence-Level IS (GSPO)&lt;/h3&gt;

&lt;p&gt;GSPO sidesteps routing replay entirely by operating at the &lt;strong&gt;sequence level&lt;/strong&gt; using a geometric mean of per-token ratios: $s_i(\theta) = (\pi_\theta(y_i \mid x) / \pi_{\theta_{old}}(y_i \mid x))^{1/\lvert y_i \rvert}$. The intuition is that aggregating across the full sequence dilutes individual routing fluctuations, avoiding the bias that Routing Replay introduces by forcing expert assignments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveats&lt;/strong&gt;: The geometric mean is not a principled approximation to the exact sequence-level IS product — it’s a heuristic that happens to have low variance. The GSPO paper [4] demonstrates stability on Qwen3’s MoE architecture, but Qwen3 appears to have relatively stable routing behavior. Whether this generalizes to architectures with more volatile routing (deeper models, different load-balancing) is unclear. The practical benefits (no Routing Replay memory/communication overhead, no target-policy bias) are real, but they come from a theoretically unmotivated transformation of the IS ratio.&lt;/p&gt;

&lt;h2 id=&quot;source-5-tool-call-trajectory-mismatch-agentic-rl--emerging&quot;&gt;Source 5: Tool-Call Trajectory Mismatch (Agentic RL) — Emerging&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This source is &lt;strong&gt;relatively under-studied&lt;/strong&gt; compared to Sources 1–4. Most of the analysis below is based on theoretical reasoning and early observations rather than large-scale empirical validation. We include it because it will become increasingly important as agentic training scales, but readers should calibrate their confidence accordingly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: In agentic RL training, trajectories are multi-turn with interleaved LM output and tool/environment output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Turn 1:  LM generates action a₁ ~ π_θ(·|x)
         Tool returns observation o₁  ← NOT from π_θ
Turn 2:  LM generates action a₂ ~ π_θ(·|x, a₁, o₁)
         Tool returns observation o₂  ← NOT from π_θ
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Tool output tokens (code execution results, search snippets, API responses) are &lt;strong&gt;not drawn from the LM policy&lt;/strong&gt; — they come from the environment. While loss is not computed over tool output tokens (they are masked), they appear in the conditioning context for subsequent LM generations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The off-policy effect&lt;/strong&gt;: Unlike Sources 1–4, where the mismatch is between different versions or implementations of the &lt;em&gt;same policy&lt;/em&gt;, Source 5 involves tokens from a fundamentally &lt;strong&gt;different generative process&lt;/strong&gt; (the environment) appearing in the LM’s context. This has three consequences:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Distribution shift in conditioning&lt;/strong&gt;: Tool outputs push the LM into distribution regions it would rarely visit through autoregressive generation alone. This is by design — tool use is valuable precisely because it gives the LM access to information it couldn’t generate — but it creates a challenging optimization landscape.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;IS ratio instability&lt;/strong&gt;: When the policy updates (Sources 1/3), the LM’s response to tool outputs can shift disproportionately. Tool-conditioned continuations sit in a high-sensitivity region of the distribution: small weight changes $\theta \to \theta’$ can cause large changes in $\pi_{\theta’}(a_t \mid \text{context with tool output})$. This amplifies IS ratios for post-tool tokens, leading to more clipping and less gradient signal from the most informationally rich parts of the trajectory.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Aggravated backend mismatch (Source 2)&lt;/strong&gt;: Low-probability tokens are where floating-point precision differences between backends matter most — relative numerical error is proportionally larger in the tail of the distribution. Since tool-conditioned context pushes the LM to generate tokens it otherwise wouldn’t, the log-prob discrepancy between sampler and learner backends is systematically worse on post-tool tokens.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why standard IS doesn’t help&lt;/strong&gt;: For Sources 1–4, the correction is conceptually clear — ratio the generating policy against the assumed policy. For Source 5, there is no “tool output policy” to ratio against. The tool outputs are fixed observations from the environment; the issue is that conditioning on them creates a more volatile optimization target, not that they were sampled from the wrong distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What practitioners currently do&lt;/strong&gt;: Masking tool output tokens from loss is standard practice. Beyond that, some apply TIS-style ratio clipping to post-tool tokens, and some frameworks compute advantages at the turn level rather than the full trajectory level.&lt;/p&gt;

&lt;h2 id=&quot;how-corrections-compose&quot;&gt;How Corrections Compose&lt;/h2&gt;

&lt;p&gt;In production systems, multiple sources of mismatch coexist. Understanding how their corrections interact is critical.&lt;/p&gt;

&lt;h3 id=&quot;ratio-decomposition&quot;&gt;Ratio Decomposition&lt;/h3&gt;

&lt;p&gt;For a fully async, disaggregated MoE system, the complete correction at the token level decomposes into four independent factors:&lt;/p&gt;

\[\underbrace{\frac{\pi_{\text{learner}}(\theta_k)}{\pi_{\text{sampler}}(\theta_k)}}_{\substack{\textbf{backend (Source 2)} \\ \text{Detached, cap } C \in [2,10] \\ \text{Eliminated if same engine}}} \;\times\; \underbrace{\frac{\pi_{\text{prox}}}{\pi_{\theta_k}}}_{\substack{\textbf{staleness (Source 3)} \\ \text{Detached (weight)} \\ \text{Eliminated if sync}}} \;\times\; \underbrace{\frac{\pi_\theta}{\pi_{\text{prox}}}}_{\substack{\textbf{trust region (Source 1)} \\ \text{Clipped } [1\!-\!\epsilon,\; 1\!+\!\epsilon] \\ \text{Always present}}} \;\times\; \underbrace{[\text{routing fix}]}_{\substack{\textbf{MoE (Source 4)} \\ \text{R2/R3 replay} \\ \text{Eliminated if dense}}}\]

&lt;p&gt;This is why &lt;strong&gt;simplifying assumptions matter&lt;/strong&gt;: using the same engine for rollout and training eliminates the backend factor (rare in practice), synchronous training eliminates the staleness factor, and tight staleness bounds ($\eta$) keep it small.&lt;/p&gt;

&lt;p&gt;GSPO claims to dilute factor 4 via sequence-level aggregation, though the theoretical basis for this is weak (see Source 4 caveats).&lt;/p&gt;

&lt;h3 id=&quot;tis--multi-epoch-disaggregated-ppogrpo&quot;&gt;TIS × Multi-Epoch (Disaggregated PPO/GRPO)&lt;/h3&gt;

&lt;p&gt;The most common combination. The full loss:&lt;/p&gt;

\[\mathcal{L} = r_{\text{TIS}} \cdot \min\left(r_{\text{PPO}} \cdot A, \text{clip}(r_{\text{PPO}}, 1-\epsilon, 1+\epsilon) \cdot A\right)\]

&lt;p&gt;These are &lt;strong&gt;orthogonal&lt;/strong&gt;: TIS corrects for numerical differences at fixed weights; the PPO ratio corrects for weight changes in the same backend. They can be applied independently.&lt;/p&gt;

&lt;h3 id=&quot;staleness--tis-fully-async-disaggregated&quot;&gt;Staleness × TIS (Fully Async Disaggregated)&lt;/h3&gt;

&lt;p&gt;When both async staleness and backend mismatch exist, three ratios are in play:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Backend correction&lt;/strong&gt;: $\frac{\pi_{\text{learner}}(\theta_k)}{\pi_{\text{sampler}}(\theta_k)}$ — same stale weights, different backends&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Staleness correction&lt;/strong&gt;: $\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}}$ — different policy versions (decoupled PPO)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Trust region ratio&lt;/strong&gt;: $\frac{\pi_\theta}{\pi_{\text{prox}}}$ — current vs proximal policy (clipped)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In practice, corrections #1 and #2 can be folded together. The key principle: &lt;strong&gt;apply TIS ratio first&lt;/strong&gt; (to bring sampler log-probs to learner space), &lt;strong&gt;then apply staleness/PPO corrections&lt;/strong&gt; on the learner-space log-probs.&lt;/p&gt;

&lt;h3 id=&quot;moe--everything-else&quot;&gt;MoE × Everything Else&lt;/h3&gt;

&lt;p&gt;MoE routing is &lt;strong&gt;entangled with&lt;/strong&gt; Sources 2 and 3 — it amplifies both backend mismatch and policy staleness:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Higher off-policiness (more mini-batches per global step) makes routing instability worse, which is why R3 becomes necessary at $N \geq 4$&lt;/li&gt;
  &lt;li&gt;Backend mismatch can cause inconsistent routing decisions even before any policy drift occurs&lt;/li&gt;
  &lt;li&gt;Both Routing Replay and TIS/clipping are needed for stable off-policy MoE training. GSPO avoids Routing Replay but substitutes a theoretically unmotivated sequence-level aggregation (see caveats in Source 4)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;tool-call-trajectories--everything-else&quot;&gt;Tool-Call Trajectories × Everything Else&lt;/h3&gt;

&lt;p&gt;Source 5 does not introduce its own IS ratio — there is no “tool output policy” to correct against. Instead, it &lt;strong&gt;amplifies the volatility&lt;/strong&gt; of all other IS ratios. Post-tool tokens sit in high-sensitivity distribution regions, so Sources 1–4 all produce larger and more variable IS ratios on these tokens. The practical implication: systems with agentic trajectories should expect more aggressive clipping and may benefit from turn-level ratio isolation to prevent cross-turn IS instability.&lt;/p&gt;

&lt;h2 id=&quot;practical-decision-tree&quot;&gt;Practical Decision Tree&lt;/h2&gt;

&lt;pre class=&quot;mermaid&quot;&gt;
flowchart TD
    START{&quot;Which corrections do you
    need for your RL system?&quot;}

    START -- &quot;Same engine
    (rare in practice)&quot; --&amp;gt; SAME[&quot;No TIS needed
    1 FP on train node&quot;]
    START -- &quot;Different engines
    (most setups, incl. colocated)&quot; --&amp;gt; DIFF[&quot;Need TIS or resampling
    2 FPs on train node&quot;]

    SAME --&amp;gt; Q2{&quot;Training mode?&quot;}
    DIFF --&amp;gt; Q2

    Q2 -- &quot;Synchronous&quot; --&amp;gt; SYNC[&quot;No staleness
    correction&quot;]
    Q2 -- &quot;Asynchronous
    (rollout overlaps train)&quot; --&amp;gt; ASYNC[&quot;Need:
    • Rate-limiting (η)
    • Decoupled PPO&quot;]

    SYNC --&amp;gt; Q3{&quot;Model type?&quot;}
    ASYNC --&amp;gt; Q3

    Q3 -- &quot;Dense&quot; --&amp;gt; DENSE[&quot;Standard
    token-level IS&quot;]
    Q3 -- &quot;MoE&quot; --&amp;gt; MOE[&quot;Routing entangled w/ Sources 2 &amp;amp; 3:
    N=1: IS only, no replay
    N=2: R2 + clipping
    N≥4: R3 + clipping
    Alt: GSPO (heuristic)&quot;]

    DENSE --&amp;gt; Q4{&quot;Trajectory type?&quot;}
    MOE --&amp;gt; Q4

    Q4 -- &quot;Single-turn&quot; --&amp;gt; SINGLE[&quot;Standard handling&quot;]
    Q4 -- &quot;Agentic / tool-use&quot; --&amp;gt; AGENT[&quot;• Mask tool tokens
    • Consider turn-level credit assignment
    • Source 5 is emerging; expect more clipping&quot;]
&lt;/pre&gt;

&lt;h3 id=&quot;common-configurations&quot;&gt;Common Configurations&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Setup&lt;/th&gt;
      &lt;th&gt;Corrections needed&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Same engine, sync, dense, 1 epoch&lt;/td&gt;
      &lt;td&gt;None (simplest possible, but rare — sacrifices inference throughput)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Colocated, sync, dense, multi-epoch&lt;/td&gt;
      &lt;td&gt;TIS + PPO clipping (colocated still typically uses different engines)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Disaggregated, sync, dense&lt;/td&gt;
      &lt;td&gt;TIS + PPO clipping&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Disaggregated, async, dense&lt;/td&gt;
      &lt;td&gt;TIS + rate-limiting ($\eta$) + decoupled PPO&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Disaggregated, sync, MoE&lt;/td&gt;
      &lt;td&gt;TIS + Routing Replay (R2 or R3 depending on $N$)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Disaggregated, async, MoE&lt;/td&gt;
      &lt;td&gt;TIS + rate-limiting + decoupled PPO + Routing Replay&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] Zheng, C., Dang, K., Yu, B., et al. “Stabilizing Reinforcement Learning with LLMs: Formulation and Practices.” arXiv preprint arXiv:2512.01374 (2025).&lt;/p&gt;

&lt;p&gt;[2] Liu, K., Liu, J. K., Chen, M., &amp;amp; Liu, Y. “Rethinking KL Regularization in RLHF.” arXiv preprint arXiv:2503.01491 (2025).&lt;/p&gt;

&lt;p&gt;[3] Yao, F., Liu, L., Zhang, D., et al. “Your Efficient RL Framework Secretly Brings You Off-Policy RL Training.” &lt;a href=&quot;https://fengyao.notion.site/off-policy-rl&quot;&gt;Blog post&lt;/a&gt;. See also &lt;a href=&quot;https://github.com/yaof20/Flash-RL&quot;&gt;Flash-RL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[4] Zheng, C., Liu, S., et al. “Group Sequence Policy Optimization.” arXiv preprint arXiv:2507.18071 (2025).&lt;/p&gt;

&lt;p&gt;[5] Mei, J. et al. “AReaL: An End-to-End Reinforcement Learning Framework for LLM Reasoning.” arXiv preprint arXiv:2505.24298 (2025).&lt;/p&gt;

&lt;hr /&gt;
</description>
        <pubDate>Sun, 01 Mar 2026 00:00:00 -0600</pubDate>
        <link>https://zoeyli.com/reinforcement%20learning/Off-Policy-Corrections-LLM-RL/</link>
        <guid isPermaLink="true">https://zoeyli.com/reinforcement%20learning/Off-Policy-Corrections-LLM-RL/</guid>
      </item>
    
      <item>
        <title>What's in Pass@K?</title>
        <description>&lt;div class=&quot;notice--info&quot;&gt;
  &lt;h2 id=&quot;tldr&quot;&gt;TL;DR&lt;/h2&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Computing pass@k&lt;/strong&gt; requires sampling N &amp;gt;= k responses and applying an unbiased combinatorial estimator — not simply sampling k times and counting.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Pass@k vs. pass@1:&lt;/strong&gt; pass@k is a nonlinear function of the pass rate p, which saturates on easy problems. Ranking models by pass@k effectively upweights hard problems.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Evaluation vs. checkpoint selection:&lt;/strong&gt; For evaluation, use large N and moderate k for stability. For checkpoint selection, you only need rankings — estimate $\hat{p} = c/N$ from moderate N and extrapolate to large k via the Bernoulli formula.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Efficient estimation for large k:&lt;/strong&gt; Use dynamic N (sample more on hard problems) and fit a Beta distribution to the difficulty distribution for stable extrapolation from limited samples.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Pass@k as RL reward:&lt;/strong&gt; It works, but is compute-inefficient — you spend equal compute on all problems then downweight easy ones. Better to upsample hard questions or allocate larger group sizes to them directly.&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;h2 id=&quot;computing-passk&quot;&gt;Computing Pass@k&lt;/h2&gt;

&lt;p&gt;Pass@k measures the probability that at least one of k sampled responses is correct. The standard way to estimate it [1] is:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Sample N responses from the model (e.g. at temperature 0.6).&lt;/li&gt;
  &lt;li&gt;Count the number of correct responses c.&lt;/li&gt;
  &lt;li&gt;Compute pass@k using the unbiased estimator:&lt;/li&gt;
&lt;/ol&gt;

\[\text{pass@k} = 1 - \frac{\binom{N-c}{k}}{\binom{N}{k}}\]

&lt;p&gt;This estimator works for any N &amp;gt;= k. It is equivalent to: out of all ways to choose k responses from N total, what fraction includes at least one correct response?&lt;/p&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;Common mistake:&lt;/strong&gt; Sample only k responses and report c/k as pass@k. This gives a biased, high-variance estimate. The unbiased estimator requires N &amp;gt;= k, and using larger N reduces variance, giving a more stable estimate of pass@k.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;pass1-vs-passk&quot;&gt;Pass@1 vs. Pass@k&lt;/h2&gt;

&lt;p&gt;At first glance, the two metrics seem to carry the same information. For a single problem, both pass@1 and pass@k are monotonically increasing in c (the number of correct responses out of N). And for reasonably large N, both give stable estimates of the model’s performance under random sampling.&lt;/p&gt;

&lt;p&gt;The key difference is in how they aggregate across problems. Let p be the per-problem pass rate (i.e. the probability of a single sample being correct). Then:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Pass@1&lt;/strong&gt; is linear in p: $\text{pass@1} = p$&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pass@k&lt;/strong&gt; is nonlinear in p: $\text{pass@k} = 1 - (1-p)^k$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As k grows, problems with high p see their pass@k saturate toward 1. A problem with p = 0.8 and a problem with p = 0.95 both have pass@k ≈ 1 for large k — the difference between them is effectively erased.&lt;/p&gt;

&lt;p&gt;Meanwhile, problems with low p remain far from saturation. A problem with p = 0.05 has pass@10 ≈ 0.40, while p = 0.15 gives pass@10 ≈ 0.80. The gap is amplified.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2026-01-30-Whats-in-Pass-at-K/pass1-vs-passk.svg&quot; alt=&quot;pass@1 vs pass@10&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure: Pass@1 (blue) and pass@10 (orange) for 10 problems sorted by difficulty. On easy problems (left), pass@10 saturates near 1 while pass@1 still varies. On hard problems (right), the gap between the two metrics widens. The dashed lines show the benchmark-level averages — pass@10 is dominated by hard problems where the model still has room to improve.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The consequence: &lt;strong&gt;ranking models by pass@k effectively upweights hard problems.&lt;/strong&gt; Improvements on easy problems (where both models already have high p) barely move the pass@k number, while improvements on hard problems (where p is low) show up clearly. This makes pass@k a useful complement to pass@1 when you care about a model’s ability to solve difficult tasks given multiple attempts.&lt;/p&gt;

&lt;h2 id=&quot;passk-for-model-evaluation-vs-checkpoint-selection&quot;&gt;Pass@k for Model Evaluation vs. Checkpoint Selection&lt;/h2&gt;

&lt;p&gt;The choice of N and k depends on what you’re using pass@k for. Two common use cases have quite different requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model evaluation.&lt;/strong&gt; The goal is to report a stable number that reflects how the model performs in practice. Users typically sample once (or a handful of times), so what matters is pass@1 or pass@k for small k. The main concern is &lt;strong&gt;stability&lt;/strong&gt;: you want large N so that the estimate of the pass rate p is precise, but k itself can stay moderate. Asymptotic correctness — what the model &lt;em&gt;could&lt;/em&gt; do given many attempts — is less important, because it doesn’t match how the model is actually used.&lt;/p&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;A common misconception:&lt;/strong&gt; Some papers justify using pass@k (with large k) as an evaluation metric because it “reflects generation diversity.” This conflates two things. Generation diversity is a property of the model’s output distribution, which is captured by the pass rate p estimated from large N. Large k doesn’t help you measure diversity — it just applies a nonlinear transform that compresses differences at the top of the distribution. To evaluate distributional properties, you need large N, not large k.&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint selection.&lt;/strong&gt; The goal is different: you want to pick the best pretrain checkpoint for SFT, or the best SFT checkpoint for RL. Here you care about the &lt;strong&gt;potential&lt;/strong&gt; of a checkpoint — not its single-shot performance, but whether it &lt;em&gt;can&lt;/em&gt; solve the problem at all. This calls for large k, because pass@k with large k measures “does the model have this capability somewhere in its distribution?”&lt;/p&gt;

&lt;p&gt;The cost structure also differs. For checkpoint selection, you only need the &lt;em&gt;ranking&lt;/em&gt; between checkpoints, not the absolute score. This means we can go further than just using moderate N — we can drop the unbiased combinatorial estimator entirely and use the Bernoulli formula directly:&lt;/p&gt;

\[\text{pass@k} = 1 - (1 - \hat{p})^k, \quad \hat{p} = c / N\]

&lt;p&gt;This sidesteps the N &amp;gt;= k requirement altogether. We estimate the pass rate $\hat{p}$ from N samples, and then extrapolate to any k we want. The estimate of $\hat{p}$ from moderate N (e.g. N = 32) is noisy, so the absolute pass@k values won’t be precise — but that’s fine, because we only need the ranking between checkpoints.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;Model Evaluation&lt;/th&gt;
      &lt;th&gt;Checkpoint Selection&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;What matters&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Stability of the score&lt;/td&gt;
      &lt;td&gt;Ranking between checkpoints&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;k&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Small to moderate&lt;/td&gt;
      &lt;td&gt;Large&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;N&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Large (for precise p)&lt;/td&gt;
      &lt;td&gt;Moderate (e.g. 32) is often sufficient&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Why&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Reflects real usage (few samples)&lt;/td&gt;
      &lt;td&gt;Measures potential / capability&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;efficient-estimation-of-passk-for-large-k&quot;&gt;Efficient estimation of pass@k for large k&lt;/h3&gt;

&lt;p&gt;When k is large, estimating pass@k per problem is wasteful if done uniformly. Two ideas can help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic N per problem.&lt;/strong&gt; Not all problems need the same sampling budget. For easy problems, a small N already gives a confident estimate of $\hat{p}$. For hard problems — where $\hat{p}$ is close to 0 — the estimate is dominated by whether you observe &lt;em&gt;any&lt;/em&gt; correct response at all. A practical strategy is to keep expanding N for hard problems until you observe 1–2 correct generations, then stop [6]. This concentrates compute where it matters most: on the hard tail of the difficulty distribution, which is exactly the region that governs pass@k for large k.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fitting a difficulty distribution.&lt;/strong&gt; Rather than estimating pass@k per problem independently, we can model the &lt;em&gt;distribution&lt;/em&gt; of pass rates across problems. Kazdan et al. [5] propose fitting a Beta distribution to the problem-level pass rates, then computing the expected pass@k under this distribution analytically. The Beta-Binomial model lets you estimate pass@k scaling from limited samples — you fit the Beta parameters $(\alpha, \beta)$ to the observed (successes, trials) counts across problems, and extrapolate to large k without ever sampling that many times. Combined with dynamic sampling (allocating more budget to hard problems), this gives reliable pass@k estimates at a fraction of the uniform-sampling cost.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2026-01-30-Whats-in-Pass-at-K/beta-passk-estimation.svg&quot; alt=&quot;Beta distribution and pass@k estimation&quot; /&gt;
&lt;em&gt;Left: Beta distributions fitted to true pass rates (blue) and noisy N=16 estimates (orange dashed). Right: pass@k scaling estimated four ways — ground truth (black), Beta fit from true p (blue dashed), Beta fit from N=16 (orange dashed), and per-problem Bernoulli from N=16 (grey dotted).&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;analysis-of-using-passk-as-an-rl-reward&quot;&gt;Analysis of Using Pass@k as an RL Reward&lt;/h2&gt;

&lt;p&gt;Even if the end goal is pass@1 performance, there’s a reason to care about hard problems during RL training. Training on easy (nearly solved) problems sharpens an already confident distribution further, reducing the model’s entropy. This impairs the model’s ability to explore new solutions and hurts its performance on hard problems. Downweighting easy examples helps preserve exploration capacity, especially in the early stages of RL.&lt;/p&gt;

&lt;p&gt;Given that pass@k upweights hard problems, a natural idea is to use pass@k as an RL reward. The mechanism [2] works as follows: divide the N rollouts for a problem into groups of k, and assign each group a reward equal to the maximum reward within that group (i.e. 1 if any response in the group is correct, 0 otherwise). All responses in the same group receive the same reward. This is effectively a Monte Carlo estimate of pass@k — and it inherits the same nonlinearity that upweights hard problems.&lt;/p&gt;

&lt;p&gt;This can help in the early stages of training by encouraging exploration on hard problems. But there are two issues worth noting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It’s an inefficient way to encourage exploration.&lt;/strong&gt; You still spend roughly the same compute on every problem — generating the same number of rollouts regardless of difficulty. The pass@k reward then downweights easy problems after the fact: on easy problems, most groups contain at least one correct response, so the reward is 1 for nearly all groups and the advantages after baseline normalization are small. You’ve paid for those rollouts but get little learning signal from them. A more direct approach is to allocate resources differently upfront:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Upsample hard questions.&lt;/strong&gt; If you know which problems are hard (from pass rate estimates), sample them more frequently in training batches.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Allocate more compute to hard problems.&lt;/strong&gt; For example, use a larger group size for hard questions — generating more rollouts per problem gives the model more chances to find a correct solution and produces a richer advantage signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both achieve the same goal of focusing learning on hard problems, but by directing compute where it matters rather than spending it uniformly and discounting it later. For recent work along these lines, see Knapsack RL [3] and AR3PO [4].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You eventually need to optimize pass@1.&lt;/strong&gt; The end goal is single-shot performance — the model should reliably produce correct answers on the first try. Optimizing pass@k encourages the model to place &lt;em&gt;some&lt;/em&gt; probability mass on correct solutions, but it doesn’t pressure the model to make the correct solution the &lt;em&gt;most likely&lt;/em&gt; one. At some point, training must shift back to a pass@1-aligned objective, or the model may plateau with a spread-out distribution that solves problems occasionally but not consistently.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] Chen, Mark, et al. “Evaluating Large Language Models Trained on Code.” arXiv preprint arXiv:2107.03374 (2021).&lt;/p&gt;

&lt;p&gt;[2] Chen, Zhipeng, et al. “Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.” arXiv preprint arXiv:2508.10751 (2025).&lt;/p&gt;

&lt;p&gt;[3] Li, Ziniu, et al. “Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.” arXiv preprint arXiv:2509.25849 (2025).&lt;/p&gt;

&lt;p&gt;[4] Zhang, Yuheng, et al. “Improving sampling efficiency in rlvr through adaptive rollout and response reuse.” arXiv preprint arXiv:2509.25808 (2025).&lt;/p&gt;

&lt;p&gt;[5] Kazdan, Joshua, et al. “Efficient Prediction of Pass@k Scaling in Large Language Models.” arXiv preprint arXiv:2510.05197 (2025).&lt;/p&gt;

&lt;p&gt;[6] Hu, Shengding, et al. “Predicting Emergent Abilities with Infinite Resolution Evaluation.” arXiv preprint arXiv:2310.03262 (2023).&lt;/p&gt;

</description>
        <pubDate>Fri, 30 Jan 2026 00:00:00 -0600</pubDate>
        <link>https://zoeyli.com/reinforcement%20learning/Whats-in-Pass-at-K/</link>
        <guid isPermaLink="true">https://zoeyli.com/reinforcement%20learning/Whats-in-Pass-at-K/</guid>
      </item>
    
      <item>
        <title>Implementing Training-Free Process Rewards in VeRL</title>
        <description>&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;/h2&gt;

&lt;p&gt;Current RLVR (RL from Verifiable Rewards) frameworks typically assign a single reward at the end of a response—correct or incorrect. But not all reasoning steps contribute equally. Some are critical insights, others are routine algebra, and some may be wasteful exploration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process rewards&lt;/strong&gt; assign credit to intermediate steps, enabling step-level credit assignment. This unlocks several applications:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Better training efficiency&lt;/strong&gt;: Denser reward signal provides more gradient information per sample&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Breaking out of zero pass rate&lt;/strong&gt;: When a problem is too hard for any complete solution, partial progress can still be rewarded&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Less verbosity&lt;/strong&gt;: Penalize unproductive reasoning loops and overthinking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-challenge&quot;&gt;The Challenge&lt;/h3&gt;

&lt;p&gt;How do you obtain step-level reward signals? Two main approaches:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Train a Process Reward Model (PRM)&lt;/strong&gt;: Requires labeled data for intermediate steps [1] or a strong model judge&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Training-free signals&lt;/strong&gt;: Derive rewards from the policy itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note that if you have a well-trained value network, that naturally provides token-level process rewards via TD-error. Value network training is out of scope for this post—we focus on training-free approaches that work with critic-free algorithms like RLOO/GRPO.&lt;/p&gt;

&lt;h2 id=&quot;training-free-process-rewards&quot;&gt;Training-Free Process Rewards&lt;/h2&gt;

&lt;h3 id=&quot;monte-carlo-estimation&quot;&gt;Monte Carlo Estimation&lt;/h3&gt;
&lt;p&gt;A key insight from VinePPO [2]: language environments are naturally “resettable”—you can return to any intermediate state simply by prompting with that prefix. This enables estimating V(prefix) at any point in a reasoning trace.&lt;/p&gt;

&lt;p&gt;VinePPO uses Monte Carlo estimation: sample K complete rollouts from each prefix and average their outcomes. This is expensive—for N steps and K samples, you need N×K rollouts per training example.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2026-01-10-process-rewards-in-verl/vinePPO.png&quot; alt=&quot;VinePPO illustration&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;log-probability-approximation&quot;&gt;Log-Probability Approximation&lt;/h3&gt;

&lt;p&gt;We approximate V(prefix) more efficiently using a single forward pass:&lt;/p&gt;

\[V(\text{prefix}) \approx \frac{1}{n}\sum_{i=1}^{n} \log P(a_i \mid \text{prefix} + \text{force\_prompt} + a_{&amp;lt;i})\]

&lt;p&gt;Instead of sampling K complete rollouts, we:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Truncate the response at an episode boundary&lt;/li&gt;
  &lt;li&gt;Append a “force answer” prompt (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;/think&amp;gt;\n\nThe answer is &lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;Compute the mean log-probability of the ground-truth answer tokens&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We use log-probability directly (not converted to probability) for numerical stability. This estimates “if the model were forced to answer now, how likely would it produce the correct answer?”&lt;/p&gt;

&lt;div class=&quot;notice--info&quot;&gt;
  &lt;p&gt;&lt;strong&gt;Assumptions and limitations&lt;/strong&gt;:&lt;/p&gt;
  &lt;ol&gt;
    &lt;li&gt;&lt;strong&gt;A small set of correct answers&lt;/strong&gt; (e.g., math problems with a single numerical answer). If the space of correct answers is large (e.g., open-ended instruction following), this approach won’t work.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;A compatible force-answer prompt&lt;/strong&gt;. The prompt (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;/think&amp;gt;\n\nThe answer is &lt;/code&gt;) must be consistent with the model’s chat template and training format.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Biased estimate&lt;/strong&gt;. Unlike Monte Carlo, this does not provide an unbiased estimate of V(prefix). We are effectively using the base model with a force-answer prompt as a &lt;em&gt;prover policy&lt;/em&gt; [4][5]—a separate policy that completes the solution from an intermediate state. The quality of V(prefix) depends on how well this prover correlates with the true probability of success.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;

&lt;h3 id=&quot;episode-segmentation&quot;&gt;Episode Segmentation&lt;/h3&gt;

&lt;p&gt;To identify intermediate states, we segment reasoning traces into “episodes” using discourse markers:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;EPISODE_MARKERS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;Wait,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Alternatively,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Actually,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Hmm,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;Let me &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;I need to &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;So &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;But &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# ... more markers
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each marker indicates a potential state boundary where we can evaluate V(prefix).&lt;/p&gt;

&lt;p&gt;We also use a &lt;strong&gt;token length fallback&lt;/strong&gt;: if no markers are found within a maximum token limit (e.g., 256 tokens), we split at sentence boundaries. This prevents issues where the model produces no markers at all, which would result in a single giant episode.&lt;/p&gt;

&lt;pre class=&quot;mermaid&quot;&gt;
flowchart LR
    A[Full Response] --&amp;gt; B[Segment by markers]
    B --&amp;gt; C[Episode 1: Problem setup]
    B --&amp;gt; D[Episode 2: Initial approach]
    B --&amp;gt; E[Episode 3: Wait, let me reconsider...]
    B --&amp;gt; F[Episode N: Final answer]

    C --&amp;gt; G[&quot;V(prefix₀)&quot;]
    D --&amp;gt; H[&quot;V(prefix₁)&quot;]
    E --&amp;gt; I[&quot;V(prefix₂)&quot;]
    F --&amp;gt; J[Final reward]
&lt;/pre&gt;

&lt;h3 id=&quot;marginal-utility&quot;&gt;Marginal Utility&lt;/h3&gt;

&lt;p&gt;With V(prefix) at each episode boundary, we compute &lt;strong&gt;marginal utility&lt;/strong&gt;:&lt;/p&gt;

\[U_i = V(\text{prefix}_i) - V(\text{prefix}_{i-1})\]

&lt;p&gt;Since V(prefix) is in log-probability space, the difference measures the log-odds improvement from each episode.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U_i &amp;gt; 0&lt;/code&gt;: Episode i made progress&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U_i &amp;lt; 0&lt;/code&gt;: Episode i was counterproductive&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U_i ≈ 0&lt;/code&gt;: Episode i didn’t change much (possibly wasteful)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;implementing-in-verl&quot;&gt;Implementing in VeRL&lt;/h2&gt;

&lt;p&gt;The natural approach is:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardManager&lt;/code&gt; that returns &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reward_tensor: [batch_size, seq_len]&lt;/code&gt; with non-zero values throughout the sequence&lt;/li&gt;
  &lt;li&gt;Let VeRL’s training loop handle the rest&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But there are subtle issues at every layer of the stack.&lt;/p&gt;

&lt;h2 id=&quot;reward-manager-architecture&quot;&gt;Reward Manager Architecture&lt;/h2&gt;

&lt;p&gt;VeRL provides two architectures for reward computation:&lt;/p&gt;

&lt;h3 id=&quot;legacy-rewardmanager-synchronous&quot;&gt;Legacy RewardManager (Synchronous)&lt;/h3&gt;

&lt;p&gt;The original approach: implement a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardManager&lt;/code&gt; class that’s called synchronously during training.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StepProgressRewardManager&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__call__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataProto&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Compute rewards synchronously
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward_tensor&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# [batch_size, seq_len]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Characteristics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Has access to the actor model&lt;/li&gt;
  &lt;li&gt;Blocks the training loop during computation&lt;/li&gt;
  &lt;li&gt;Simple to implement and debug&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: To use a custom &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardManager&lt;/code&gt;, you must set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;use_reward_loop=False&lt;/code&gt; in your config. Otherwise, VeRL defaults to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardLoopManager&lt;/code&gt; and silently bypasses your custom reward manager—a subtle source of bugs.&lt;/p&gt;
&lt;/div&gt;

&lt;h3 id=&quot;rewardloopmanager-async--now-default&quot;&gt;RewardLoopManager (Async) — Now Default&lt;/h3&gt;

&lt;p&gt;The default code path in VeRL when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reward_model.enable=True&lt;/code&gt;. Runs reward computation asynchronously via a separate model server.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CustomRewardLoopManager&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reward_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward_model&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Separate model instance (e.g., vLLM)
&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;compute_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Async reward computation
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward_tensor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Characteristics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Async processing—doesn’t block training&lt;/li&gt;
  &lt;li&gt;Well-suited for &lt;strong&gt;external reward models&lt;/strong&gt;: trained RMs, LLM-as-judge, rule-based verifiers&lt;/li&gt;
  &lt;li&gt;Does not have access to the actor model’s current weights&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;when-to-use-which&quot;&gt;When to Use Which&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Use Case&lt;/th&gt;
      &lt;th&gt;Recommended&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;External RM (trained reward model)&lt;/td&gt;
      &lt;td&gt;RewardLoopManager ✅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;LLM-as-judge&lt;/td&gt;
      &lt;td&gt;RewardLoopManager ✅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Rule-based verification&lt;/td&gt;
      &lt;td&gt;Either works&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Rewards derived from actor model (e.g., V(prefix))&lt;/td&gt;
      &lt;td&gt;Legacy RewardManager ✅&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;notice--info&quot;&gt;
  &lt;p&gt;&lt;strong&gt;Key consideration&lt;/strong&gt;: If your reward computation requires the &lt;em&gt;current policy&lt;/em&gt; (e.g., estimating V(prefix) for process rewards), the RewardLoopManager creates synchronization issues—the reward model copy can diverge from the actor during training. In this case, the legacy RewardManager is more appropriate.&lt;/p&gt;
&lt;/div&gt;

&lt;h3 id=&quot;optimization-pre-compute-during-generation&quot;&gt;Optimization: Pre-compute During Generation&lt;/h3&gt;

&lt;p&gt;Computing V(prefix) requires forward passes through the actor model. If done during reward computation, this blocks the training loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Pre-compute during generation, when the actor is already loaded.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# In generation phase:
# 1. Generate response
# 2. Segment into episodes
# 3. For each episode boundary:
#    - Construct prefix + force_answer_prompt + ground_truth
#    - Compute log_probs using actor (already loaded!)
#    - Store V(prefix) in batch.meta_info[&quot;prefix_value_cache&quot;]
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# In training phase:
# 1. Retrieve pre-computed V(prefix) from cache
# 2. Compute marginal utilities (cheap math)
# 3. Compute process rewards (cheap math)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This gave us &lt;strong&gt;60-70% speedup&lt;/strong&gt; compared to computing everything during training.&lt;/p&gt;

&lt;h2 id=&quot;advantage-estimation&quot;&gt;Advantage Estimation&lt;/h2&gt;

&lt;p&gt;Even with correct process rewards computed, most advantage estimators will destroy the fine-grained structure.&lt;/p&gt;

&lt;h3 id=&quot;the-data-flow&quot;&gt;The Data Flow&lt;/h3&gt;

&lt;p&gt;Here’s how rewards flow through VeRL’s training pipeline:&lt;/p&gt;

&lt;pre class=&quot;mermaid&quot;&gt;
flowchart TD
    A[RewardManager returns reward_tensor]
    B[compute_reward in trainer/ppo/reward.py]
    C[Store as token_level_scores]
    D[apply_kl_penalty optional]
    E[Store as token_level_rewards]
    F[compute_advantage in trainer/ppo/core_algos.py]
    G[GAE: Uses per-token rewards]
    H[GRPO: Sums to scalar]
    I[Token-level advantages]
    J[Scalar advantages]

    A --&amp;gt; B
    B --&amp;gt; C
    C --&amp;gt; D
    D --&amp;gt; E
    E --&amp;gt; F
    F --&amp;gt; G
    F --&amp;gt; H
    G --&amp;gt; I
    H --&amp;gt; J

    classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
    classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px

    class G good
    class H bad
&lt;/pre&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;The subtle issue&lt;/strong&gt;: Most advantage estimators immediately collapse token-level rewards to scalars:&lt;/p&gt;

  &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Standard GRPO in VeRL (core_algos.py:301)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;token_level_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# [batch, seq_len] → [batch]
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;advantages&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group_std&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# scalar per sequence
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This &lt;strong&gt;destroys the fine-grained credit assignment&lt;/strong&gt; you carefully designed! Your process rewards are summed into a single number before computing advantages.&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The critical question&lt;/strong&gt;: Does your advantage estimator use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;token_level_rewards[:, t]&lt;/code&gt; at each timestep, or does it sum first?&lt;/p&gt;

&lt;h3 id=&quot;token-level-preserving-estimators&quot;&gt;Token-Level Preserving Estimators&lt;/h3&gt;

&lt;p&gt;You need an advantage estimator that operates on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;token_level_rewards[:, t]&lt;/code&gt; at each timestep. For example, the GAE estimator function preserves this structure:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;reversed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;delta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;token_level_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gamma&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nextvalues&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;lastgaelam&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gamma&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lam&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lastgaelam&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you want to keep using GRPO, you will need to implement your own estimator:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Collect all rewards from group
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;group_rewards&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Normalize each token by group statistics
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean_R&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std_R&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rewards_normalized&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rewards&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean_R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std_R&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Compute advantages as cumulative sum
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;advantages&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rewards_normalized&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cumsum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;normalization-pitfall-mixed-reward-scales&quot;&gt;Normalization Pitfall: Mixed Reward Scales&lt;/h3&gt;

&lt;div class=&quot;notice--warning&quot;&gt;
  &lt;p&gt;&lt;strong&gt;When combining outcome rewards with process rewards, normalize them separately.&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;A subtle bug: normalizing outcome rewards (scale: 0-1) together with process rewards (scale: ~±0.03) makes the process reward signal negligible.&lt;/p&gt;

  &lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Identify and normalize each reward type separately:&lt;/p&gt;

  &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Outcome rewards at last token (like standard GRPO)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outcome_mask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# last valid token of each response
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id2outcome&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;group_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;token_level_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outcome_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Process rewards at episode boundaries (separate normalization)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;process_mask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nonzero_mask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outcome_mask&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;id2process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;group_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;token_level_rewards&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;process_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;In this post, we discuss possible ways of obtaining a training-free process reward (that does not rely on an external reward model) and walked through the implementation details in VeRL.&lt;/p&gt;

&lt;p&gt;A few takeaways:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Log-probability approximation is efficient&lt;/strong&gt;: Estimating V(prefix) via $\log P(\text{answer} \mid \text{prefix})$ requires one forward pass, vs K×N rollouts for MC estimation&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Marginal utility captures step-level progress&lt;/strong&gt;: $U_i = V(\text{prefix}&lt;em&gt;i) - V(\text{prefix}&lt;/em&gt;{i-1})$ measures how much each episode helps or hurts&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Choose the right reward manager architecture&lt;/strong&gt;: Legacy &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardManager&lt;/code&gt; works better when rewards depend on the current policy; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RewardLoopManager&lt;/code&gt; (now default) is designed for external reward models&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Pre-compute during generation&lt;/strong&gt;: Move V(prefix) computation out of the training critical path for 60-70% speedup&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Use token-level advantage estimators&lt;/strong&gt;: Standard GRPO collapses to scalars—use GAE or token-preserving GRPO&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Normalize reward types separately&lt;/strong&gt;: Mixed-scale rewards (outcome + process) need separate normalization to preserve signal strength&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] Lightman, Hunter, et al. “Let’s verify step by step.” The Twelfth International Conference on Learning Representations. 2023. https://arxiv.org/pdf/2305.20050&lt;/p&gt;

&lt;p&gt;[2] Kazemnejad et al. (2024). &lt;em&gt;VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment&lt;/em&gt;. https://arxiv.org/abs/2410.01679&lt;/p&gt;

&lt;p&gt;[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl&lt;/p&gt;

&lt;p&gt;[4] Setlur et al. (2024). &lt;em&gt;Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning&lt;/em&gt;. https://arxiv.org/abs/2410.08146&lt;/p&gt;

&lt;p&gt;[5] Qu et al. (2025). &lt;em&gt;Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning&lt;/em&gt;. https://arxiv.org/abs/2503.07572&lt;/p&gt;

</description>
        <pubDate>Sat, 10 Jan 2026 00:00:00 -0600</pubDate>
        <link>https://zoeyli.com/reinforcement%20learning/process-rewards-in-verl/</link>
        <guid isPermaLink="true">https://zoeyli.com/reinforcement%20learning/process-rewards-in-verl/</guid>
      </item>
    
      <item>
        <title>Implementing On-Policy Distillation: Lessons from Building OPD in VeRL</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;The standard recipe for distilling a strong teacher into a smaller student is straightforward: generate reasoning traces from the teacher, fine-tune the student on those traces (SFT), then run RL.
As the teacher traces are off-policy, the student might forget its own knowledge when trying to fit the teacher’s output and this causes a distribution-shift during inference (the student might encounter states that the teacher has never seen and will not be able to recover).
One of the other side-effects is that the student inherits the teacher’s verbosity patterns from static data, then must unlearn them during RL.&lt;/p&gt;

&lt;p&gt;On-Policy Distillation (OPD) [1] seems promising as it operates on student outputs, thus removing the train-test mismatch issue. A natural idea is to integrate OPD with regular RL training since it can reuse most of the RL infrastructure. Conceptually, OPD can be directly implemented by replacing the reference policy used to compute the KL divergence term with the teacher policy.
The teacher then evaluates the student’s own rollouts and provides a KL divergence signal that pushes the student toward better reasoning—selectively, on prompts where the student’s pass rate is low. This idea has been explored concurrently in KDRL [2], which proposes a unified framework combining knowledge distillation with reinforcement learning but does not release an implementation. The idea is simple. The implementation is a bit more than that.&lt;/p&gt;

&lt;p&gt;This post is the engineering story. We built OPD for the &lt;a href=&quot;https://github.com/volcengine/verl&quot;&gt;VeRL&lt;/a&gt; training framework, and document the architecture, the key implementation pitfalls, the design choice between implementing KL as a loss term vs. as advantage replacement, and the constraints we discovered along the way.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;architecture-overview&quot;&gt;Architecture Overview&lt;/h2&gt;

&lt;h3 id=&quot;why-not-just-swap-the-reference-policy&quot;&gt;Why Not Just Swap the Reference Policy?&lt;/h3&gt;

&lt;p&gt;As mentioned in the introduction, the conceptual pitch for OPD is simple: standard RL training already computes KL divergence against a reference policy (a frozen snapshot of the initial model) to prevent the student from drifting too far. Just point that KL term at the teacher instead, and the student gets pulled toward the teacher’s distribution rather than its own starting point.&lt;/p&gt;

&lt;p&gt;The primary obstacle is hardware. The reference policy is the same architecture as the student, so it can share the same GPUs (or be offloaded cheaply). A 72B teacher serving a 4B student cannot. The teacher must run on separate hardware, which means network communication, serialization, and a client-server protocol that the reference policy path was never designed for. This is the main architectural requirement that drives everything else.&lt;/p&gt;

&lt;p&gt;Optionally, the teacher can be a closed source model queried through API, or a shared service that is used by multiple RL workloads.&lt;/p&gt;

&lt;p&gt;Given these requirements, OPD needs to be a separate subsystem. The teacher runs on dedicated hardware behind a vLLM server, communicating via ZeroMQ:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Student Training                        Teacher Server 
┌──────────────────────────┐            ┌──────────────────────┐
│ 1. Generate rollouts     │            │  vLLM Engine         │
│ 2. Compute rewards       │            │  - Teacher model     │
│ 3. Compute pass rates    │            │  - Independent TP    │
│ 4. Create eligibility    │  ZeroMQ    │  - Logprob compute   │
│    masks                 │───────────►│                      │
│                          │◄───────────│                      │
│ 5. Fetch teacher logprobs│            └──────────────────────┘
│ 6. RL loss + KD loss     │
└──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;eligibility mask&lt;/strong&gt; determines which rollouts receive teacher guidance. For each prompt, we compute the pass rate—the fraction of rollouts that receive a positive reward. Prompts with low pass rates are “hard”: the student struggles with them. Only failed rollouts for hard prompts are eligible for OPD, focusing the teacher signal where it is most needed.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;pitfall-1-legacy-vs-new-workerknow-your-code-path&quot;&gt;Pitfall 1: Legacy vs. New Worker—Know Your Code Path&lt;/h2&gt;

&lt;h3 id=&quot;the-problem&quot;&gt;The Problem&lt;/h3&gt;

&lt;p&gt;We implemented OPD, launched training, and saw encouraging metrics in W&amp;amp;B: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/frac_opd_samples&lt;/code&gt; at ~5%, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/frac_underperforming_prompts&lt;/code&gt; at ~5-10%. These are computed in the trainer before the update step. They confirmed the masking logic worked.&lt;/p&gt;

&lt;p&gt;But four worker-level metrics were missing: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/kl_loss&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/kd_coef&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/num_eligible_samples&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/frac_tokens_with_kd&lt;/code&gt;. No errors. No crashes. Training continued normally as pure RL—no teacher guidance was applied at all.&lt;/p&gt;

&lt;p&gt;The root cause: VeRL has &lt;strong&gt;two worker implementations&lt;/strong&gt; with different loss computation paths.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Implementation&lt;/th&gt;
      &lt;th&gt;Config Setting&lt;/th&gt;
      &lt;th&gt;Loss Location&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Legacy worker&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;use_legacy_worker_impl = &quot;auto&quot;&lt;/code&gt; (default)&lt;/td&gt;
      &lt;td&gt;Hardcoded inline in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp_actor.py:update_policy()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;New worker&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;use_legacy_worker_impl = &quot;disable&quot;&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Pluggable via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;losses.py:ppo_loss()&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set_loss_fn()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;We implemented OPD loss in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;losses.py&lt;/code&gt;—the new worker’s pluggable loss path. But the default configuration uses the legacy worker, where loss computation is hardcoded in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update_policy()&lt;/code&gt;: policy gradient + entropy + KL penalty, with no OPD term. Our code was never called.&lt;/p&gt;

&lt;h3 id=&quot;why-it-was-hard-to-detect&quot;&gt;Why It Was Hard to Detect&lt;/h3&gt;

&lt;p&gt;Several factors conspired to make this silent:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Partial metrics created false confidence.&lt;/strong&gt; The trainer-level metrics (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opd/frac_opd_samples&lt;/code&gt;) are computed &lt;em&gt;before&lt;/em&gt; the worker step and appeared correctly, suggesting the pipeline was active.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;No crashes or errors.&lt;/strong&gt; The legacy worker’s key selection (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select_keys&lt;/code&gt;) didn’t include OPD keys, so teacher logprobs were silently dropped from the batch. The loss function in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;losses.py&lt;/code&gt; was never called, so its conditions were never evaluated.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Training progressed normally.&lt;/strong&gt; Without OPD loss, training just ran pure RL. The model still learned, accuracy improved, everything looked reasonable.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;the-fix&quot;&gt;The Fix&lt;/h3&gt;

&lt;p&gt;Implement OPD loss directly in the legacy worker’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update_policy()&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp_actor.py&lt;/code&gt;, after the existing KL penalty section.&lt;/p&gt;

&lt;p&gt;Also add OPD keys to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select_keys&lt;/code&gt; so the data survives batch serialization:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# In dp_actor.py: Include OPD keys in worker data
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;teacher_log_probs&quot;&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;select_keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;teacher_log_probs&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;opd_eligibility_mask&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;opd_horizon_mask&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;prompts&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;pitfall-2-scalar-logprobsfull-distributions-are-not-needed&quot;&gt;Pitfall 2: Scalar Logprobs—Full Distributions Are Not Needed&lt;/h2&gt;

&lt;h3 id=&quot;the-math-dictates-the-implementation&quot;&gt;The Math Dictates the Implementation&lt;/h3&gt;

&lt;p&gt;OPD adds a KL divergence term between student and teacher distributions. Conceptually, computing full KL divergence requires the complete vocabulary distribution from both models:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;KL(p || q) = sum_x p(x) * log(p(x) / q(x))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For Qwen models with a vocabulary of ~152k tokens, this means transferring a 152k-dimensional vector per token position from the teacher server. At thousands of positions per sequence and hundreds of sequences per batch, this is impractical.&lt;/p&gt;

&lt;p&gt;Both KL estimators we use sidestep this entirely. They approximate KL divergence using only the scalar logprobs of the &lt;em&gt;actual generated tokens&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x_t&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;K1:  KL ≈ log p_student(x_t) - log p_teacher(x_t)
K2:  KL ≈ 0.5 * (log p_student(x_t) - log p_teacher(x_t))^2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Whether you choose K1 or K2 (discussed later in the loss vs. advantage section), both require exactly one scalar per position from the teacher—not a 152k-dimensional distribution.&lt;/p&gt;

&lt;h3 id=&quot;the-mistake&quot;&gt;The Mistake&lt;/h3&gt;

&lt;p&gt;Our initial teacher server code missed this and requested full vocabulary logprobs from vLLM:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# BROKEN: Requesting full vocab
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sampling_params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SamplingParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;prompt_logprobs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 151,669 tokens!
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This hit vLLM’s limit: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Requested prompt logprobs of 151669, which is greater than max allowed: 20&quot;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix requests only the top-1 logprob per position, then extracts the actual generated token’s probability:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# FIXED: Request only scalar logprobs
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sampling_params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SamplingParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;prompt_logprobs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Just the actual next token
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# returned prompt_logprobs is a dictionary 
# Extract scalar logprob for each position
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;token_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;actual_token_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;token_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;logprob&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prompt_logprobs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actual_token_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logprob&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;token_logprobs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logprob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A note on shape alignment.&lt;/strong&gt; Once teacher logprobs reached the worker, we hit a shape mismatch: the student worker returns response-only logprobs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[batch, 8192]&lt;/code&gt;, but the teacher server returns full-sequence logprobs such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[batch, 9216]&lt;/code&gt; (prompt + response). Teacher logprobs need to be sliced to the response portion (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;teacher_log_probs[:, prompt_len:]&lt;/code&gt;) before computing KL.&lt;/p&gt;

&lt;h3 id=&quot;takeaway&quot;&gt;Takeaway&lt;/h3&gt;

&lt;p&gt;Always review the mathematical requirements of your objective before implementing the data pipeline. Both K1 and K2 only need scalar logprobs—but the server code was written as if full KL were needed. Understanding the estimator formulas before writing the server would have avoided both the efficiency problem and the vLLM limit entirely.&lt;/p&gt;

&lt;h2 id=&quot;design-choice-kl-as-loss-vs-kl-as-advantage&quot;&gt;Design Choice: KL as Loss vs. KL as Advantage&lt;/h2&gt;

&lt;p&gt;Once the infrastructure works—teacher server responds, logprobs reach the worker, shapes align—you still face a design question: how should the teacher’s KL signal enter the training objective?&lt;/p&gt;

&lt;p&gt;We implemented two approaches. Both are viable, but they use different KL estimators and have different balancing challenges.&lt;/p&gt;

&lt;h3 id=&quot;approach-1-kl-as-a-separate-loss-term&quot;&gt;Approach 1: KL as a Separate Loss Term&lt;/h3&gt;

&lt;p&gt;Add a KD loss term alongside the policy gradient loss:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;L = L_PG + beta * L_KD
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;L_KD = mean(KL(student || teacher) * mask)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This approach should use the &lt;strong&gt;K2 estimator&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;KL_K2 = 0.5 * (log p_student - log p_teacher)^2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;K2 is appropriate here because it functions as a loss: always non-negative, with gradients that vanish when the student matches the teacher. The squared term gives it the right properties for minimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The balancing problem.&lt;/strong&gt; K2 values are large when the student-teacher gap is significant. With a logprob difference of 2–3 nats (common early in training), K2 produces values of 2–4.5 per token. Meanwhile, the policy gradient loss is typically around 0.01. A coefficient of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;beta = 0.1&lt;/code&gt; means the KD term contributes ~0.2–0.45, dominating the PG loss by an order of magnitude.&lt;/p&gt;

&lt;p&gt;This makes coefficient tuning fragile. Too high and teacher guidance overwhelms the RL signal. Too low and it has no effect. The right value depends on training stage, problem difficulty, and student-teacher gap—all of which change over time.&lt;/p&gt;

&lt;p&gt;We also found that ratio clipping (clamping the teacher/student probability ratio, similar to PPO’s clip) helps stabilize this approach, but adds another hyperparameter.&lt;/p&gt;

&lt;h3 id=&quot;approach-2-kl-as-advantage-replacement&quot;&gt;Approach 2: KL as Advantage Replacement&lt;/h3&gt;

&lt;p&gt;Instead of adding a separate loss, replace the RL advantages for hard prompts with teacher-derived advantages:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# For hard prompts: replace RL advantages with teacher signal
# Negate K1 so that tokens the teacher favors get positive advantage
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opd_advantages&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;log&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p_student&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;log&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p_teacher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;horizon_mask&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;advantages&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;eligible&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;          &lt;span class=&quot;c1&quot;&gt;# hard prompt, failed rollout
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;opd_advantages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# teacher guidance
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;rl_advantages&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;# standard RL
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;horizon_mask&lt;/code&gt; limits OPD to the first K tokens of each response, since early tokens have more influence on the reasoning trajectory.&lt;/p&gt;

&lt;p&gt;This approach should use the &lt;strong&gt;K1 estimator&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;KL_K1 = log p_student - log p_teacher
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;K1 is the right choice here because the KL signal is being used as a &lt;em&gt;reward&lt;/em&gt;, not a loss. The advantage is the negation of K1: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-(log p_student - log p_teacher) = log p_teacher - log p_student&lt;/code&gt;. This is directly interpretable as a per-token reward: positive when the teacher assigns higher probability to the generated token (the teacher “approves”), negative when it assigns lower probability. Squaring this (K2) would destroy the sign, losing the directional information that makes advantages useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No coefficient to balance.&lt;/strong&gt; Since teacher advantages replace RL advantages (rather than being added to a different loss), the policy gradient loss operates on a single unified set of advantages. There’s no beta to tune between competing loss terms.&lt;/p&gt;

&lt;h3 id=&quot;why-both-require-careful-normalization&quot;&gt;Why Both Require Careful Normalization&lt;/h3&gt;

&lt;p&gt;In standard GRPO, advantages are normalized within a batch (zero mean, unit variance). This normalization is critical for stable policy gradient updates.&lt;/p&gt;

&lt;p&gt;When OPD advantages enter the picture, &lt;strong&gt;normalization must be handled carefully regardless of approach&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the loss approach, the KD loss has a fundamentally different scale from the PG loss. The coefficient beta attempts to bridge this, but it’s a static scalar applied to a dynamic quantity.&lt;/p&gt;

&lt;p&gt;In the advantage approach, the issue is subtler. OPD advantages (raw KL differences) and RL advantages (normalized outcome rewards) have different distributions. If GRPO normalizes the combined set of advantages—some from RL, some from teacher KL—the two populations contaminate each other’s statistics. A batch dominated by hard prompts (many OPD advantages) shifts the normalization in ways that distort the RL signal for easy prompts, and vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normalize OPD advantages separately.&lt;/strong&gt; Compute mean and variance over OPD-eligible tokens only, normalize those, then combine with separately-normalized RL advantages. This ensures each population is properly scaled before they enter the same policy gradient computation.&lt;/p&gt;

&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;KL as Loss&lt;/th&gt;
      &lt;th&gt;KL as Advantage&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;KL estimator&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;K2: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0.5 * (log p_s - log p_t)^2&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;K1: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log p_s - log p_t&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Why this estimator&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Non-negative, proper loss for minimization&lt;/td&gt;
      &lt;td&gt;Preserves sign, interpretable as per-token reward&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Balancing&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Coefficient beta between PG and KD losses&lt;/td&gt;
      &lt;td&gt;No coefficient; separate normalization instead&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Integration point&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Worker loss function (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp_actor.py&lt;/code&gt;)&lt;/td&gt;
      &lt;td&gt;Trainer, after advantage computation (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ray_trainer.py&lt;/code&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Hard/easy separation&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Masking within loss&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;torch.where&lt;/code&gt; on advantages&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;constraint-shared-tokenizer-and-chat-template&quot;&gt;Constraint: Shared Tokenizer and Chat Template&lt;/h2&gt;

&lt;h3 id=&quot;the-requirement&quot;&gt;The Requirement&lt;/h3&gt;

&lt;p&gt;Our current OPD implementation requires that the student and teacher share the same tokenizer. Both KL estimators compute a difference of logprobs at each position &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt; for the generated token &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x_t&lt;/code&gt;. This only makes sense if both models agree on what token ID &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x_t&lt;/code&gt; represents.&lt;/p&gt;

&lt;p&gt;But sharing a tokenizer is necessary and not sufficient. The student and teacher must also share the &lt;strong&gt;same chat template&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;why-chat-template-matters&quot;&gt;Why Chat Template Matters&lt;/h3&gt;

&lt;p&gt;Even when two models use the same tokenizer (same vocabulary, same BPE merges), they may use different chat templates. Consider Qwen3-4B-Base and Qwen3-32B-Instruct:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Same tokenizer&lt;/strong&gt;: Both use the Qwen3 tokenizer with identical vocabulary&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Different templates&lt;/strong&gt;: The instruct model wraps generations in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;|im_end|&amp;gt;&lt;/code&gt;, and may use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags; the base model produces raw text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When computing KL divergence, the teacher assigns high probability to template tokens like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt; at positions where the student has never learned to produce them. The token IDs match (same tokenizer), but the probability distributions are fundamentally misaligned. KL divergence explodes, gradients become unstable, and training diverges.&lt;/p&gt;

&lt;h3 id=&quot;practical-solutions&quot;&gt;Practical Solutions&lt;/h3&gt;

&lt;p&gt;We identified two approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Use a format-compatible teacher.&lt;/strong&gt; Run RL directly from the same base model checkpoint (without chat template SFT), so both student and teacher produce raw text. This works but requires training an additional large model, defeating the purpose of efficient training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Pre-align the student via Rejection Fine-Tuning (RFT).&lt;/strong&gt; Before OPD training, generate rollouts from the base student, filter to correct solutions, and fine-tune the student on these correct outputs formatted with the teacher’s template. The student learns to produce &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags and chat formatting, making its output structure compatible with the teacher’s distribution. This is the approach we use—we discuss the RFT procedure and its interaction with RL training in an upcoming post.&lt;/p&gt;

&lt;h3 id=&quot;current-limitation&quot;&gt;Current Limitation&lt;/h3&gt;

&lt;p&gt;Supporting different tokenizers between student and teacher would require token-level alignment (mapping between vocabularies), which we have not implemented. For now, OPD is restricted to student-teacher pairs that share a tokenizer and—after any necessary pre-alignment—produce structurally compatible outputs.&lt;/p&gt;

&lt;h2 id=&quot;relation-to-verls-gkd-recipe&quot;&gt;Relation to VeRL’s GKD Recipe&lt;/h2&gt;

&lt;p&gt;VeRL ships a &lt;a href=&quot;https://github.com/volcengine/verl/tree/main/recipe/gkd&quot;&gt;GKD (Generalized Knowledge Distillation) recipe&lt;/a&gt; that performs on-policy distillation with a similar teacher server architecture (vLLM + ZeroMQ). Our implementation diverges in three ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synchronous vs. off-policy scheduling.&lt;/strong&gt; Our OPD runs synchronously: the teacher scores the current step’s rollouts before the actor update begins. GKD overlaps computation phases using async schedulers (one-step-off, two-step-off): the actor update for step N runs concurrently with rollout generation for step N+1, and teacher inference is pipelined across steps. This delivers significant throughput gains but means the rollout policy is one or two steps behind the current parameters—slightly off-policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top-k distributions vs. scalar logprobs.&lt;/strong&gt; GKD fetches top-k token distributions from the teacher (typically k=10), storing both logprobs and token indices per position. This enables computing KL divergence over the teacher’s probability mass instead of a single token. However the KDRL paper reported this setup to be unstable for training, so we did not implement this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KL computed in Megatron engine vs. in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update_policy()&lt;/code&gt;.&lt;/strong&gt; GKD computes KL divergence inside Megatron’s forward pass via a custom &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;logits_processor&lt;/code&gt; and a hand-written TP-aware autograd function that handles all-reduce across tensor-parallel ranks. This is efficient but ties the implementation to the Megatron backend. Our OPD operates on logprobs after the forward pass, in the actor’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update_policy()&lt;/code&gt; (or in the trainer for the advantage approach). This works with VeRL’s FSDP backend and doesn’t require custom autograd, at the cost of not leveraging Megatron’s pipeline overlap or native TP-aware KL computation.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The OPD implementation touched 10 files across a distributed system spanning trainer, workers, and an external teacher server. The dominant failure mode was &lt;strong&gt;silent degradation&lt;/strong&gt;: every issue resulted in a condition quietly evaluating to false, with no errors and partial metrics suggesting everything was working.&lt;/p&gt;

&lt;p&gt;Four lessons stand out:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Trace the code path from config to gradient.&lt;/strong&gt; In a framework with multiple worker implementations, the clean pluggable interface may not be the one that’s active.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Let the math dictate the data pipeline.&lt;/strong&gt; Both K1 and K2 estimators only need scalar logprobs, not full vocabulary distributions. Understanding this upfront avoids both efficiency problems and API limits.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Match the KL estimator to the integration point.&lt;/strong&gt; K2 (squared) is a proper loss for minimization. K1 (linear) preserves sign information needed for advantages. Using the wrong estimator for the wrong approach either destroys useful signal or creates unstable optimization.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Format compatibility is a hard constraint.&lt;/strong&gt; Shared tokenizer is necessary but not sufficient—student and teacher must produce structurally compatible outputs for KL divergence to be meaningful.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two broader challenges remain. First, balancing the teacher signal with RL rewards is difficult regardless of design approach. Whether it’s a loss coefficient or advantage normalization, getting this wrong means either the teacher overwhelms RL or has no effect at all.&lt;/p&gt;

&lt;p&gt;Second, OPD fundamentally assumes that the teacher and student policies are similar enough for the KL signal to be useful. Format compatibility (shared tokenizer and chat template) is the most visible aspect of this, but the issue runs deeper. Since we evaluate the teacher’s logprob on the student’s sampled tokens, tokens that the student generates but the teacher would rarely produce yield extreme logprob differences and noisy KL estimates. The more the student’s distribution diverges from the teacher’s—whether due to different training stages, different capabilities, or different reasoning styles—the noisier the OPD signal becomes. This is a fundamental limitation of on-policy distillation with scalar KL estimators: the signal quality depends on the overlap between the two policies.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] Agarwal et al. (2024). &lt;em&gt;On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes&lt;/em&gt;. ICLR 2024.&lt;/p&gt;

&lt;p&gt;[2] Xu et al. (2025). &lt;em&gt;KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning&lt;/em&gt;. https://arxiv.org/abs/2506.08946&lt;/p&gt;

&lt;p&gt;[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl&lt;/p&gt;
</description>
        <pubDate>Tue, 06 Jan 2026 00:00:00 -0600</pubDate>
        <link>https://zoeyli.com/reinforcement%20learning/implementing-on-policy-distillation/</link>
        <guid isPermaLink="true">https://zoeyli.com/reinforcement%20learning/implementing-on-policy-distillation/</guid>
      </item>
    
      <item>
        <title>Understanding Length Dynamics in RL Training</title>
        <description>&lt;h2 id=&quot;mystery-of-length-growth-from-rl&quot;&gt;Mystery of Length Growth from RL&lt;/h2&gt;

&lt;p&gt;DeepSeek-R1 [1] reported that Reinforcement Learning (RL) can drive models to “think” for thousands of steps to solve complex problems. This “thinking” process—manifested as extended Chain-of-Thought (CoT)—is widely believed to be the key to breakthrough reasoning performance.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/deepseek-r1.png&quot; alt=&quot;Deepseek R1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, replicating this length explosion has proven elusive. While some replication efforts report a U-shaped growth pattern, others observe lengths declining or stagnating over training.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Two early replication efforts of DeepSeek-R1: DeepScaleR [2] and Simple-RL-Zero [3] both reported that the output length changed with the number of steps following a U-shape&lt;/em&gt;&lt;/p&gt;
&lt;div style=&quot;display: flex; gap: 10px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/deepscaleR.png&quot; alt=&quot;DeepScaleR&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/simple-rl-zero.png&quot; alt=&quot;Simple-RL-Zero&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;At the same time, the paper “When More is Less: Understanding Chain-of-Thought Length in LLMs” [4] and the Concise Reasoning paper [9] both report declining output length over training:&lt;/p&gt;
&lt;div style=&quot;display: flex; gap: 20px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/more-is-less-length.png&quot; alt=&quot;more-is-less&quot; style=&quot;width: 60%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/concise-reasoning.png&quot; alt=&quot;concise-reasoning&quot; style=&quot;width: 40%;&quot; /&gt;
&lt;/div&gt;

&lt;!-- This brings us to our first question: can we reliably reproduce length growth only under correctness rewards?  --&gt;
&lt;p&gt;This brings us to a fundamental question: What actually drives a model to think longer? Is it an inherent property of RL, a bias in the the loss function design, or something else entirely? In this post, we systematically isolate the drivers of length dynamics.&lt;/p&gt;

&lt;h2 id=&quot;experimental-setup&quot;&gt;Experimental Setup&lt;/h2&gt;

&lt;p&gt;To cleanly reproduce DeepSeek-R1-Zero behavior, all of the experiments mentioned in this post use the following configuration unless otherwise specified:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Model initialization&lt;/strong&gt;: We use base model checkpoints (not instruction-tuned), following the R1-Zero approach of starting from pretrained weights&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;RL algorithm&lt;/strong&gt;: We use the default GRPO implementation in VeRL with a clip ratio of 0.2.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reward design&lt;/strong&gt;: We employ only a binary correctness reward based on answer parsing (with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;math_verify&lt;/code&gt; for math datasets), with no format reward and no KL regularization term&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Generation parameters&lt;/strong&gt;: The maximum output length is set to 16,384 tokens to minimize truncation effects. The sampling temperature is set to 1.0 and top-p= 1.0.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hyperparameters&lt;/strong&gt;: Initial learning rate of 1e-6, batch size of 32 with 8 rollouts per prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This minimal setup allows us to isolate the effects of pure correctness-based RL on length dynamics, without confounding factors from auxiliary rewards or regularization terms.&lt;/p&gt;

&lt;h2 id=&quot;initial-experiments-length-decrease-and-stabilization&quot;&gt;Initial Experiments: Length Decrease and Stabilization&lt;/h2&gt;

&lt;p&gt;Our pilot run with Qwen3-4B on MATH to replicate the length growth effect was unsuccessful. The model showed length decrease and then stabilizing. This was also the case with Qwen2.5-3B and Qwen2.5-1.5B models.
Note that we set the maximum output length to 16k to minimize truncation. Although it takes the Qwen2.5 1.5B model longer to learn the correct format, &lt;em&gt;all&lt;/em&gt; models reached an output of ~600 tokens at the end of training.&lt;/p&gt;

&lt;p&gt;Even after we switched to the Nemotron Science dataset [5], we were still consistently observing this pattern of decrease and stabilization (bottom right figure).&lt;/p&gt;

&lt;div style=&quot;display: grid; grid-template-columns: 1fr 1fr; gap: 10px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-math-length.svg&quot; alt=&quot;Qwen3-4B on Math&quot; style=&quot;width: 100%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen2.5-3b-math-length.svg&quot; alt=&quot;Qwen2.5-3B on Math&quot; style=&quot;width: 100%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen2.5-1.5b-math-length.svg&quot; alt=&quot;Qwen2.5-1.5B on Math&quot; style=&quot;width: 100%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-science-length.svg&quot; alt=&quot;Qwen3-4B on Science&quot; style=&quot;width: 100%;&quot; /&gt;
&lt;/div&gt;

&lt;h3 id=&quot;hypothesis-1-is-kl-divergence-the-key&quot;&gt;Hypothesis 1: Is KL Divergence the Key?&lt;/h3&gt;

&lt;p&gt;In the Demystifying CoT paper [4], they attempt to reproduce the curve from Simple-RL-Zero [3], but fail to observe substantial length growth. They attribute this mild length growth to KL regularization, pulling the model back to the long output distribution it initially started with.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/demystifying-cot-full.png&quot; alt=&quot;demystifying-cot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Following this hypothesis, we added KL divergence back to the objective with a weight of 0.01 (same as Simple-RL-Zero):&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 20px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-kl-math-length.svg&quot; alt=&quot;qwen3-math-with-kl&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-kl-math-loss.svg&quot; alt=&quot;qwen3-math-with-kl-loss&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Adding this KL term does not recover length growth; in fact, the KL loss increases while lengths still shrink. This suggests that KL regularization alone does not explain the discrepancy with Simple-RL-Zero and Demystifying CoT under our experiment settings.&lt;/p&gt;

&lt;h3 id=&quot;hypothesis-2-is-grpo-length-bias-in-effect&quot;&gt;Hypothesis 2: Is GRPO Length Bias in Effect?&lt;/h3&gt;

&lt;p&gt;The Dr.GRPO paper [6] points out an inherent bias in GRPO’s loss function that creates asymmetric incentives based on sequence length. Since we were using the GRPO algorithm out-of-the-box, this seemingly could be the culprit.&lt;/p&gt;

&lt;h4 id=&quot;length-bias-statement-from-drgrpo&quot;&gt;Length Bias Statement (from Dr.GRPO)&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Original GRPO&lt;/strong&gt; computes the loss per sequence by averaging the advantage over tokens, then averaging across sequences:&lt;/p&gt;

\[\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \left\{ \min \left[ \frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,&amp;lt;t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,&amp;lt;t})} \hat{A}_{y_i}, \text{clip}\left(\frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,&amp;lt;t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,&amp;lt;t})}, 1-\epsilon, 1+\epsilon\right) \hat{A}_{y_i} \right] \right\}\]

&lt;p&gt;where the advantage $\hat{A}_{y_i}$ is computed per sequence, and $G$ is the group size. The critical issue is that &lt;strong&gt;the per-sequence loss averages over token count&lt;/strong&gt;, while &lt;strong&gt;the advantage is the same for all tokens in a sequence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This creates asymmetric incentives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For correct answers&lt;/strong&gt; ($\hat{A}_{y_i} &amp;gt; 0$):&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Short sequence (length $L_{\text{short}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{short}}}$ → &lt;strong&gt;larger per-token update&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Long sequence (length $L_{\text{long}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{long}}}$ → &lt;strong&gt;smaller per-token update&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Result:&lt;/strong&gt; Favors shorter responses among correct ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For incorrect answers&lt;/strong&gt; ($\hat{A}_{y_i} &amp;lt; 0$):&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Short sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{short}}}$ → &lt;strong&gt;larger per-token penalty&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Long sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{long}}}$ → &lt;strong&gt;smaller per-token penalty&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Result:&lt;/strong&gt; Favors longer responses among incorrect ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now if we consider the setting of our pilot run with Qwen3-4B on MATH, we observe that the training reward approaches 0.8 and the validation accuracy shoots to pass@1=75%.
Following Dr. GRPO’s reasoning, this would imply that the shortening effect among correct responses would dominate and drive the average length down.&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-math-rewards.svg&quot; alt=&quot;qwen3-math-reward&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-math-accuracy.svg&quot; alt=&quot;qwen3-math-validation-acc&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;However, applying the Dr.GRPO fix to the loss calculation gives us the figure below, which is still very similar to what we observed before:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-grpo-fix-math-length.svg&quot; alt=&quot;Qwen3-4b on Math with GRPO fix&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;verls-loss-aggregation-three-aggregation-methods&quot;&gt;VeRL’s Loss Aggregation: Three Aggregation Methods&lt;/h4&gt;

&lt;p&gt;A closer look at VeRL’s default implementation reveals that &lt;strong&gt;the default aggregation is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;token-mean&lt;/code&gt;&lt;/strong&gt;, following DAPO:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_agg_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;token-mean&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# DAPO (default)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;verl_F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;masked_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss_mat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_agg_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;seq-mean-token-sum&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Dr. GRPO
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;seq_losses&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss_mat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# token-sum
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq_losses&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# seq-mean
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_agg_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;seq-mean-token-mean&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# original GRPO
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;seq_losses&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss_mat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# token-mean
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq_losses&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# seq-mean
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Code snippet taken from VeRL codebase&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This means &lt;strong&gt;our runs already use token-mean aggregation&lt;/strong&gt;, which eliminates sequence normalization length bias, so this particular GRPO length bias is not present in our setup, and cannot explain our initial observations.&lt;/p&gt;

&lt;p&gt;Let $L_i$ denote the length of sequence $i$, and $L_{\max}$ denote the maximum length (16k in our experiments). The three methods differ in how they aggregate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Original GRPO&lt;/strong&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;seq-mean-token-mean&lt;/code&gt;):
\(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{1}{L_i} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right)\)&lt;/p&gt;

&lt;p&gt;This normalizes by actual sequence length $L_i$, creating the length bias.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dr.GRPO&lt;/strong&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;seq-mean-token-sum&lt;/code&gt;):
\(\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{L_i}{L_{\max}} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) \cdot L_{\max}\)&lt;/p&gt;

&lt;p&gt;By summing (or equivalently, dividing by the same $L_{\max}$ for all sequences), this removes the asymmetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Token-mean (DAPO [7])&lt;/strong&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;token-mean&lt;/code&gt;):
\(\mathcal{L} = \frac{1}{\sum_{i=1}^{G} L_i} \sum_{i=1}^{G} \sum_{t=1}^{L_i} \mathcal{L}_{i,t}\)&lt;/p&gt;

&lt;p&gt;This treats all tokens equally regardless of which sequence they belong to, eliminating length-based bias from aggregation.&lt;/p&gt;

&lt;h2 id=&quot;difficulty-imbalance-drives-length-growth&quot;&gt;Difficulty Imbalance Drives Length Growth&lt;/h2&gt;

&lt;p&gt;As we were unable to reproduce the length growth with the MATH dataset or Nemotron Science dataset, we turned to explore other datasets.&lt;/p&gt;

&lt;p&gt;We then experimented with a hard subset of BigMath [8], which is a cleaned version of 250k math problems for RL training. &lt;strong&gt;With extended training up to 1000 steps, we were able to see a clear U shape in the output length.&lt;/strong&gt;&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;BigMath Dataset Composition Details&lt;/b&gt;&lt;/summary&gt;

To ensure the dataset contains sufficiently challenging problems, we applied the following filtering to the original BigMath dataset:

&lt;strong&gt; Source filtering:&lt;/strong&gt; We kept only challenging sources (olympiads, omnimath, harp, aops_forum, amc_aime, big_math, openmath), while removing sources that overlap with previous experiments or are too easy (math, gsm8k, cn_k12, orca_math).

&lt;strong&gt;Difficulty filtering:&lt;/strong&gt;  BigMath provides pass rate using Llama3.1-8B as a reference, so we performed filtering by keeping only problems with &amp;lt;30% pass rate. 

&lt;/details&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-bigmath-length.svg&quot; alt=&quot;Qwen3-4b on BigMath&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As a control treatment, if we remove the hard questions in the training set(with pass@1 &amp;lt;0.3 measured on the checkpoint at step 100), the output length will fluctuate, but does not show a consistent upward trend.&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-difficulty-distribution.png&quot; alt=&quot;early-ckpt-difficulty-distribution&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/qwen3-4b-bigmath-easy-comparison.svg&quot; alt=&quot;easy-comparison&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Here ‘hard’ is defined relative to the early checkpoint, so this is a model-dependent notion of difficulty, and removing those questions also changes the content distribution. Nonetheless, under this intervention we no longer observe consistent length growth, which is consistent with the idea that the hard tail is a key driver.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;a-moving-reward-distribution&quot;&gt;A Moving Reward Distribution&lt;/h3&gt;
&lt;p&gt;To understand why a dataset of mixed difficulty would cause length growth,  we create an easy split and hard split of the validation set based on the pass rate on an early checkpoint (at 100 steps.)
The easy split is defined as questions with pass@1 &amp;gt; 90% and the hard split is defined as questions with pass@1 &amp;lt; 10%.
We examine the reward distribution and output length distribution of each of these question categories.&lt;/p&gt;

&lt;p&gt;What we expect: overall we think hard questions require longer responses to solve compared to easy questions.&lt;/p&gt;

&lt;p&gt;On the hard question subset, our observations confirm our intuition: the model gets higher reward when the output is longer (the flat tail is because the model does not ever generate such long outputs).&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-hard-reward.png&quot; alt=&quot;early-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step1000-hard-reward.png&quot; alt=&quot;late-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-hard-length.png&quot; alt=&quot;early-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step1000-hard-length.png&quot; alt=&quot;late-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The median output length significantly shifts over training, demonstrating a mode-seeking behavior towards higher reward.&lt;/p&gt;

&lt;p&gt;For the easy questions, as the model is already good at solving them, we would expect the length to stay unchanged, as there is little room of improvement.&lt;/p&gt;

&lt;p&gt;However, when we examine the empirical reward distribution, we observe a rightward shift of the reward peak, indicating that the model can now maintain correctness at longer response lengths. The overall inverted-U shape remains—extremely long responses still decrease accuracy—but the curve has become flatter and the optimal length has shifted from approximately 500 to 700 tokens. This means the model has learned to generate longer responses while preserving accuracy on easy problems.&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-easy-run-2.png&quot; alt=&quot;early-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step1000-easy-run-2.png&quot; alt=&quot;late-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;!-- 
![early-ckpt-easy-reward](/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-easy-run-2.png)
![late-ckpt-easy-reward](/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step1000-easy-run-2.png) --&gt;

&lt;p&gt;The output length distribution also shifts in accordance with the reward distribution.&lt;/p&gt;
&lt;div style=&quot;display: flex; gap: 5px;&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step100-easy-length.png&quot; alt=&quot;early-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
  &lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/a8-step1000-easy-length.png&quot; alt=&quot;late-ckpt-easy-reward&quot; style=&quot;width: 50%;&quot; /&gt;
&lt;/div&gt;

&lt;!-- This reward distribution shift is much more mild when we remove the hard questions, the distribution stays similar to what we observe at 100 steps. 
![late-ckpt-easy-only-easy-reward](/assets/images/posts/2025-12-21-understanding-length-dynamics/a10-v2-step1000-easy-v2.png)
 --&gt;

&lt;p&gt;Combining our observations on easy problems and hard problems, we can quantify the shift in both output length and performance:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Checkpoint&lt;/th&gt;
      &lt;th&gt;Easy Problems (Pass Rate)&lt;/th&gt;
      &lt;th&gt;Easy Problems (Avg Length)&lt;/th&gt;
      &lt;th&gt;Hard Problems (Pass Rate)&lt;/th&gt;
      &lt;th&gt;Hard Problems (Avg Length)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 100 (Early)&lt;/td&gt;
      &lt;td&gt;95.1%&lt;/td&gt;
      &lt;td&gt;530&lt;/td&gt;
      &lt;td&gt;2.3%&lt;/td&gt;
      &lt;td&gt;1039&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 1000 (Late)&lt;/td&gt;
      &lt;td&gt;97.0%&lt;/td&gt;
      &lt;td&gt;661&lt;/td&gt;
      &lt;td&gt;5.5%&lt;/td&gt;
      &lt;td&gt;1301&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The table shows that both easy and hard problems experience length increases during training, with the magnitude and direction of the shift differing based on problem difficulty.&lt;/p&gt;

&lt;p&gt;This reveals the mechanism behind length growth on mixed-difficulty datasets. The model learns to generate longer outputs to solve hard problems (step 100→1000: +262 tokens, pass rate 2.3%→5.5%). However, this policy shift affects all problems broadly—easy problems also become longer (+131 tokens). Crucially, the model learns to maintain correctness on easy problems even at these longer lengths, which is reflected in the flattening of their empirical reward distribution. Without hard problems present (easy-only training), the model has no incentive to grow longer and the distribution remains stable.&lt;/p&gt;

&lt;div style=&quot;background-color: #fffbeb; border-left: 4px solid #f59e0b; padding: 20px; margin: 15px 0; border-radius: 4px;&quot;&gt;

&lt;strong&gt;The Principle of Asymmetric Risk&lt;/strong&gt;
&lt;br /&gt;
💡 &lt;strong&gt;Key Insight:&lt;/strong&gt; We hypothesize this behavior is driven by &lt;strong&gt;asymmetric risk&lt;/strong&gt;. For a mixed-difficulty dataset, the model faces a trade-off:
&lt;br /&gt;
&lt;strong&gt;On Hard Problems:&lt;/strong&gt; &quot;Thinking short&quot; carries a high risk of failure (low reward). &quot;Thinking long&quot; is necessary to solve them.
&lt;br /&gt;
&lt;strong&gt;On Easy Problems:&lt;/strong&gt; &quot;Thinking long&quot; carries little risk (as long as the length stays within the flattened band). As the reward distribution flattens, the model can spend extra tokens without hurting its accuracy.
&lt;br /&gt;
&lt;strong&gt;Consequently,&lt;/strong&gt; the model adopts a &quot;better safe than sorry&quot; global policy. It learns to extend its reasoning to cover the hard tail of the distribution. Since this extended length doesn&apos;t penalize performance on easy problems, the length drift bleeds into the easy subset—resulting in the &quot;overthinking&quot; we observe.

&lt;/div&gt;

&lt;h3 id=&quot;case-studies&quot;&gt;Case Studies&lt;/h3&gt;

&lt;p&gt;To understand how length growth manifests in individual outputs, we examine two problems at different difficulty levels.&lt;/p&gt;

&lt;h4 id=&quot;easy-problem-coin-counting&quot;&gt;Easy Problem: Coin Counting&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Petya has 25 coins, each with a denomination of 1, 2, 5, or 10 rubles. Among these coins, 19 are not two-ruble coins, 20 are not ten-ruble coins, and 16 are not one-ruble coins. How many five-ruble coins does Petya have?&lt;/p&gt;

&lt;p&gt;For this problem, the output length grew from &lt;strong&gt;493 tokens&lt;/strong&gt; (step 100) to &lt;strong&gt;602 tokens&lt;/strong&gt; (step 1000) on average, a &lt;strong&gt;22% increase&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A simple and short solution to the problem produced by the model would look like the following, which sits at 338 tokens:&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;📝 Example: Concise solution (338 tokens)&lt;/b&gt;&lt;/summary&gt;

&lt;div style=&quot;background-color: #f8f9fa; border-left: 4px solid #007bff; padding: 15px; margin: 10px 0;&quot;&gt;

To determine the number of five-ruble coins Petya has, we will use the information given and set up a system of equations. Let&apos;s denote the number of one-ruble coins by $a$, the number of two-ruble coins by $b$, the number of five-ruble coins by $c$, and the number of ten-ruble coins by $d$.

**We know the following:**
1. The total number of coins is 25.
2. There are 19 coins that are not two-ruble coins, which means there are $25 - 19 = 6$ two-ruble coins.
3. There are 20 coins that are not ten-ruble coins, which means there are $25 - 20 = 5$ ten-ruble coins.
4. There are 16 coins that are not one-ruble coins, which means there are $25 - 16 = 9$ one-ruble coins.

**From the above information, we can write the following equations:**

$$a + b + c + d = 25$$

$$b = 6, \quad d = 5, \quad a = 9$$

**Now, we substitute** $b = 6$, $d = 5$, and $a = 9$ into the first equation:

$$9 + 6 + c + 5 = 25$$

$$20 + c = 25$$

$$c = 5$$

Thus, the number of five-ruble coins Petya has is $\boxed{5}$.

&lt;/div&gt;
&lt;/details&gt;

&lt;p&gt;&lt;strong&gt;After training (step 1000)&lt;/strong&gt;, outputs became 22% longer. Here’s a representative example (595 tokens):&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;📝 Example: Step 1000 solution with verification (595 tokens)&lt;/b&gt;&lt;/summary&gt;

&lt;div style=&quot;background-color: #fff8f0; border-left: 4px solid #ff9800; padding: 15px; margin: 10px 0;&quot;&gt;

Let&apos;s denote the number of 1-ruble coins, 2-ruble coins, 5-ruble coins, and 10-ruble coins by $a$, $b$, $c$, and $d$ respectively. We know the following:

1. The total number of coins is 25: $a + b + c + d = 25$
2. There are 19 coins that are not 2-ruble coins: $a + c + d = 19$
3. There are 20 coins that are not 10-ruble coins: $a + b + c = 20$
4. There are 16 coins that are not 1-ruble coins: $b + c + d = 16$

**We need to find the value of** $c$, the number of 5-ruble coins.

**First,** we subtract the second equation from the first equation:

$$(a + b + c + d) - (a + c + d) = 25 - 19$$

$$b = 6$$

**Next,** we subtract the third equation from the first equation:

$$(a + b + c + d) - (a + b + c) = 25 - 20$$

$$d = 5$$

**Then,** we subtract the fourth equation from the first equation:

$$(a + b + c + d) - (b + c + d) = 25 - 16$$

$$a = 9$$

**Now we substitute** $a = 9$, $b = 6$, and $d = 5$ into the first equation:

$$9 + 6 + c + 5 = 25$$

$$20 + c = 25$$

$$c = 5$$

**To verify, we check the consistency of the values with the given conditions:**

1. The total number of coins is: $a + b + c + d = 9 + 6 + 5 + 5 = 25$ ✓
2. The number of coins that are not 2-ruble coins is: $a + c + d = 9 + 5 + 5 = 19$ ✓
3. The number of coins that are not 10-ruble coins is: $a + b + c = 9 + 6 + 5 = 20$ ✓
4. The number of coins that are not 1-ruble coins is: $b + c + d = 6 + 5 + 5 = 16$ ✓

All conditions are satisfied, so the number of 5-ruble coins is $\boxed{5}$.

&lt;/div&gt;
&lt;/details&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt; The solution remains correct but becomes more elaborate. The model now includes explicit verification steps, checking each condition after deriving the answer. While this adds length, it doesn’t provide computational advantage for such a straightforward problem.&lt;/p&gt;

&lt;h4 id=&quot;harder-problem-circle-and-right-triangle-geometry&quot;&gt;Harder Problem: Circle and Right Triangle Geometry&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; On the leg BC of right triangle ABC, a circle is constructed with BC as the diameter. The circle intersects the hypotenuse at point D such that AD:BD = 1:3. The altitude from vertex C (the right angle) to the hypotenuse is 3. Find the leg BC.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/geometry_problem_visualization.png&quot; alt=&quot;Geometry Problem Visualization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concise solution approach&lt;/strong&gt; (using geometric mean theorem):&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Since BC is diameter, ∠BDC = 90° by Thales’ theorem → D is the altitude foot&lt;/li&gt;
  &lt;li&gt;Apply altitude geometric mean: h² = AD · BD
    &lt;ul&gt;
      &lt;li&gt;3² = x · 3x (where AD = x, BD = 3x)&lt;/li&gt;
      &lt;li&gt;x = √3, so AB = 4√3&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Apply leg-segment relationship: BC² = BD · AB = 3√3 · 4√3 = 36
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;BC = 6&lt;/strong&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For this problem, outputs from step 1000 averaged &lt;strong&gt;2021 tokens&lt;/strong&gt; compared to &lt;strong&gt;1544 tokens&lt;/strong&gt; at step 100—a &lt;strong&gt;31% increase&lt;/strong&gt;. Solutions from later training tend to use algebraic approaches with systems of equations rather than leveraging geometric relationships directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative solution from step 100&lt;/strong&gt; (clean geometric approach):&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;✅ Example: Clean geometric solution from step 100&lt;/b&gt;&lt;/summary&gt;

&lt;div style=&quot;background-color: #f0f9ff; border-left: 4px solid #0ea5e9; padding: 15px; margin: 10px 0;&quot;&gt;

To solve the problem, we need to use the properties of the right triangle and the circle constructed on the hypotenuse. Let&apos;s denote the right triangle $ABC$ with $\angle ACB = 90^\circ$, and let $BC = a$, $AC = b$, and $AB = c$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. This altitude splits the hypotenuse into two segments, $AD$ and $BD$, such that $AD : BD = 1 : 3$.

**Step 1: Apply geometric mean theorem**

In a right triangle, the altitude to the hypotenuse creates two smaller right triangles that are similar to the original triangle and to each other. Therefore, the altitude $CD$ is the geometric mean of the segments of the hypotenuse it creates.

Let $AD = x$ and $BD = 3x$. Then, $AB = AD + BD = x + 3x = 4x$. Since the altitude $CD$ is 3, we have:

$$CD^2 = AD \cdot BD$$

$$3^2 = x \cdot 3x$$

$$9 = 3x^2$$

$$x = \sqrt{3}$$

Thus, $AD = \sqrt{3}$ and $BD = 3\sqrt{3}$, and the hypotenuse $AB = 4\sqrt{3}$.

**Step 2: Use Thales&apos; theorem**

Since the circle is constructed on $BC$ as the diameter and intersects the hypotenuse at point $D$, we know that $\angle BDC = 90^\circ$ (Thales&apos; theorem). This means that $D$ is the foot of the altitude from $C$ to $AB$.

**Step 3: Apply Pythagorean theorem**

Since $D$ is the foot of the altitude, we can find the length of $BC$ using the Pythagorean theorem in the smaller right triangle $BDC$:

$$BC^2 = BD^2 + CD^2$$

$$BC^2 = (3\sqrt{3})^2 + 3^2$$

$$BC^2 = 27 + 9 = 36$$

$$BC = 6$$

Thus, the length of $BC$ is $\boxed{6}$.

&lt;/div&gt;
&lt;/details&gt;

&lt;p&gt;This solution efficiently combines the geometric mean theorem, Thales’ theorem, and the Pythagorean theorem to reach the answer in a clear logical sequence.&lt;/p&gt;

&lt;p&gt;However, not all step 100 outputs are this clean. Some start correctly but make algebraic errors mid-solution, yet still arrive at the correct answer:&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;⚠️ Example: Flawed solution from step 100 (correct answer despite errors)&lt;/b&gt;&lt;/summary&gt;

&lt;div style=&quot;background-color: #fff5f5; border-left: 4px solid #ef4444; padding: 15px; margin: 10px 0;&quot;&gt;

To solve for the length of the leg $BC$ in the right triangle $ABC$ with the given conditions, we will use geometric properties and relationships. Let&apos;s denote the lengths of the sides as follows: $AB = c$, $BC = a$, and $CA = b$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. The circle with diameter $BC$ intersects the hypotenuse $AB$ at point $D$ such that $AD : BD = 1 : 3$.

First, we use the property of the right triangle and the altitude to the hypotenuse. The altitude $CH$ from $C$ to $AB$ can be expressed as:

$$CH = \frac{ab}{c} = 3$$

This gives us the relationship: $ab = 3c$.

Next, we use the fact that $AD : BD = 1 : 3$. Let $AD = x$ and $BD = 3x$. Therefore, the length of the hypotenuse $AB$ is:

$$AB = AD + BD = x + 3x = 4x$$

Since $AB = c$, we have $c = 4x$.

**❌ Error: Fabricated formula.** Now, we use the fact that $D$ lies on the circle with diameter $BC$. This means that $\angle BDC = 90^\circ$. By the power of a point theorem (also known as the intersecting chords theorem), we have:

$$AD \cdot BD = CD \cdot HD$$

&lt;span style=&quot;color: #ef4444; font-weight: bold;&quot;&gt;This formula is incorrect! The power of point theorem doesn&apos;t apply this way.&lt;/span&gt;

Since $D$ is on the circle with diameter $BC$, $CD$ is the altitude from $C$ to $AB$, and $HD$ is the segment of the altitude from $D$ to $H$. However, since $D$ is on the circle, we can use the fact that $CD = CH = 3$ and $HD = 0$ (since $D$ is on the hypotenuse and the altitude is perpendicular to the hypotenuse). Thus, we have:

$$AD \cdot BD = 3 \cdot (c - 3)$$

Substituting $AD = x$ and $BD = 3x$, we get:

$$x \cdot 3x = 3 \cdot (4x - 3)$$

$$3x^2 = 12x - 9$$

$$x^2 - 4x + 3 = 0$$

$$(x - 1)(x - 3) = 0$$

Thus, the solutions are $x = 1$ or $x = 3$. Since $x = 3$ would make $AB = 12$, which is not possible given the altitude constraint, we have $x = 1$. Therefore, $AB = 4x = 4$.

Now, we need to find the length of $BC$. Using the relationship $ab = 3c$ and knowing $c = 4$, we have:

$$ab = 3 \cdot 4 = 12$$

We also know that in a right triangle, the legs $a$ and $b$ satisfy the Pythagorean theorem:

$$a^2 + b^2 = c^2 = 16$$

We can solve this system by expressing $b$ in terms of $a$:

$$b = \frac{12}{a}$$

Substituting into the Pythagorean equation:

$$a^4 - 16a^2 + 144 = 0$$

Let $y = a^2$:

$$y^2 - 16y + 144 = 0$$

$$y = \frac{16 \pm \sqrt{256 - 576}}{2} = \frac{16 \pm \sqrt{-320}}{2}$$

**❌ Error: Negative discriminant.** &lt;span style=&quot;color: #ef4444; font-weight: bold;&quot;&gt;This gives a negative discriminant, indicating the approach has gone wrong!&lt;/span&gt;

Since the discriminant is negative, we need to re-evaluate the possible values of $a$ and $b$. Given the constraints, we can solve for $a$ and $b$ directly by trial and error or by solving the quadratic equation correctly. The correct values are $a = 3$ and $b = 4$ or vice versa. Therefore, the length of $BC$ is:

$$\boxed{6}$$

&lt;/div&gt;
&lt;/details&gt;

&lt;p&gt;This example shows a common pattern at step 100: the solution makes mathematical errors (fabricated power-of-point formula, negative discriminant) but somehow recovers to produce the correct final answer—suggesting the model may be leveraging memorized answers rather than purely following its derivation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative solution from step 1000&lt;/strong&gt; (algebraic brute-force approach):&lt;/p&gt;

&lt;details&gt;
&lt;summary&gt;&lt;b&gt;📐 Example: Long algebraic solution from step 1000 (~100 lines)&lt;/b&gt;&lt;/summary&gt;

&lt;div style=&quot;background-color: #faf5ff; border-left: 4px solid #9333ea; padding: 15px; margin: 10px 0;&quot;&gt;

To solve the problem, we start with the right triangle $ABC$ where $\angle C = 90^\circ$. Let $BC = a$ and $AC = b$. The hypotenuse $AB = c$ can be found using the Pythagorean theorem:

$$c = \sqrt{a^2 + b^2}$$

A circle is constructed with $BC$ as the diameter, so the center of the circle is the midpoint of $BC$, and the radius is $\frac{a}{2}$. Since $BC$ is the diameter, any angle subtended by $BC$ on the circle is a right angle. Therefore, $\angle BDC = 90^\circ$, making $D$ the foot of the perpendicular from $C$ to the hypotenuse $AB$.

Given that $AD : BD = 1 : 3$, we can let $AD = x$ and $BD = 3x$. Thus, the length of the hypotenuse $AB = c = AD + BD = x + 3x = 4x$.

The altitude from $C$ to the hypotenuse $AB$ is given as 3. Let the foot of the perpendicular from $C$ to $AB$ be $D$. The area of the triangle $ABC$ can be calculated in two ways. First, using the legs $a$ and $b$:

$$\text{Area} = \frac{1}{2} \times a \times b$$

Second, using the hypotenuse $c$ and the altitude from $C$ to $AB$:

$$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$

Equating the two expressions for the area, we get:

$$\frac{1}{2} \times a \times b = 6x \implies ab = 12x$$

Since $AB = c = 4x$, we use the Pythagorean theorem:

$$c = \sqrt{a^2 + b^2} = 4x \implies a^2 + b^2 = (4x)^2 = 16x^2$$

We now have two key equations:

1. $ab = 12x$
2. $a^2 + b^2 = 16x^2$

We can express $b$ in terms of $a$ and $x$ from the first equation:

$$b = \frac{12x}{a}$$

Substitute $b = \frac{12x}{a}$ into the second equation:

$$a^2 + \left(\frac{12x}{a}\right)^2 = 16x^2 \implies a^2 + \frac{144x^2}{a^2} = 16x^2$$

Multiply through by $a^2$ to clear the fraction:

$$a^4 + 144x^2 = 16x^2a^2 \implies a^4 - 16x^2a^2 + 144x^2 = 0$$

Let $y = a^2$. Then the equation becomes a quadratic in $y$:

$$y^2 - 16x^2y + 144x^2 = 0$$

We solve this quadratic equation using the quadratic formula $y = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$:

$$y = \frac{16x^2 \pm \sqrt{(16x^2)^2 - 4 \cdot 1 \cdot 144x^2}}{2 \cdot 1} = \frac{16x^2 \pm \sqrt{256x^4 - 576x^2}}{2} = \frac{16x^2 \pm \sqrt{64x^2(4x^2 - 9)}}{2} = \frac{16x^2 \pm 8x\sqrt{4x^2 - 9}}{2} = 8x^2 \pm 4x\sqrt{4x^2 - 9}$$

So, we have two solutions for $y$:

$$y = 8x^2 + 4x\sqrt{4x^2 - 9} \quad \text{or} \quad y = 8x^2 - 4x\sqrt{4x^2 - 9}$$

Since $y = a^2$, we need to determine which solution is valid. We know $a$ and $b$ are positive, so we need to check the feasibility of both solutions.

If $y = 8x^2 + 4x\sqrt{4x^2 - 9}$, then:

$$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$$

If $y = 8x^2 - 4x\sqrt{4x^2 - 9}$, then:

$$a^2 = 8x^2 - 4x\sqrt{4x^2 - 9}$$

We need to find $x$ such that both $a$ and $b$ are real and positive. Let&apos;s test the feasible solution:

Since $b = \frac{12x}{a}$, we substitute $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$:

$$b^2 = \left(\frac{12x}{a}\right)^2 = \frac{144x^2}{a^2} = \frac{144x^2}{8x^2 + 4x\sqrt{4x^2 - 9}} = \frac{144x^2}{4x(2x + \sqrt{4x^2 - 9})} = \frac{36x}{2x + \sqrt{4x^2 - 9}}$$

Rationalizing the denominator:

$$b^2 = \frac{36x(2x - \sqrt{4x^2 - 9})}{(2x + \sqrt{4x^2 - 9})(2x - \sqrt{4x^2 - 9})} = \frac{36x(2x - \sqrt{4x^2 - 9})}{4x^2 - (4x^2 - 9)} = \frac{36x(2x - \sqrt{4x^2 - 9})}{9} = 4x(2x - \sqrt{4x^2 - 9})$$

We need to check if $a^2 + b^2 = 16x^2$:

$$a^2 + b^2 = (8x^2 + 4x\sqrt{4x^2 - 9}) + 4x(2x - \sqrt{4x^2 - 9}) = 8x^2 + 4x\sqrt{4x^2 - 9} + 8x^2 - 4x\sqrt{4x^2 - 9} = 16x^2$$

This confirms our solution is correct. Now we need to find $a$:

Since $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$, we need to find $x$. Using the altitude property:

The area of the triangle is also given by:

$$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$

We already have $ab = 12x$. To find $x$, we use the relationship involving the segments of the hypotenuse:

Since $AD = x$ and $BD = 3x$, we use the property of the altitude in a right triangle:

$$CD^2 = AD \cdot BD = x \cdot 3x = 3x^2 \implies CD = \sqrt{3x^2} = x\sqrt{3}$$

But we know $CD = 3$, so:

$$x\sqrt{3} = 3 \implies x = \sqrt{3}$$

Now we find $a$:

$$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9} = 8(\sqrt{3})^2 + 4\sqrt{3}\sqrt{4(\sqrt{3})^2 - 9} = 8 \cdot 3 + 4\sqrt{3}\sqrt{12 - 9} = 24 + 4\sqrt{3} \cdot \sqrt{3} = 24 + 4 \cdot 3 = 24 + 12 = 36$$

So, $a = \sqrt{36} = 6$.

Thus, the length of $BC$ is $\boxed{6}$.

&lt;/div&gt;
&lt;/details&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt; This step 1000 solution is mathematically complete and rigorous—every algebraic step is justified and the logic is sound. However, it takes an algebraic brute-force approach, setting up a quartic equation and using the quadratic formula with denominator rationalization. The solution doesn’t leverage the elegant geometric insight upfront (that D = H, making the geometric mean theorem directly applicable).&lt;/p&gt;

&lt;p&gt;What could be solved in 3 key steps (geometric mean → find x → apply leg-segment relationship) instead becomes a ~100-line algebraic derivation. The model only applies the geometric mean theorem near the end (line 446) after extensive algebraic machinery. While this demonstrates strong algebraic skills, it represents a shift away from geometric insight toward computational thoroughness.&lt;/p&gt;

&lt;h4 id=&quot;key-takeaways-from-case-studies&quot;&gt;Key Takeaways from Case Studies&lt;/h4&gt;

&lt;p&gt;These examples illustrate two manifestations of length growth:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Added verification steps&lt;/strong&gt; (coin problem): The model learns to explicitly verify solutions even when unnecessary, adding ~100 tokens without improving accuracy on simple problems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Shift toward algebraic thoroughness&lt;/strong&gt; (geometry problem): Later checkpoints favor algebraic brute-force methods over geometric insights, even when the geometric approach is more elegant and efficient. This increases token count by ~500 tokens while maintaining correctness.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both patterns demonstrate how training on mixed-difficulty datasets causes the model to apply more elaborate reasoning strategies across all problems—a form of “overthinking” on easy questions that emerges as a side effect of learning to solve harder ones.&lt;/p&gt;

&lt;h2 id=&quot;mitigating-overthinking-length-penalty-interventions&quot;&gt;Mitigating Overthinking: Length Penalty Interventions&lt;/h2&gt;

&lt;p&gt;Overthinking has been widely recognized as computationally wasteful, leading to growing interest in mitigation strategies. The most common approach is reward shaping through length penalties—penalties applied to longer outputs. Truncation can be viewed as an extreme length penalty where outputs exceeding a limit receive zero advantage.&lt;/p&gt;

&lt;p&gt;We test several penalty variants and connect them to our findings on the mechanism behind length growth.&lt;/p&gt;

&lt;h3 id=&quot;the-challenge&quot;&gt;The Challenge&lt;/h3&gt;

&lt;p&gt;The asymmetric risk principle creates a dilemma: the model needs longer reasoning chains for hard problems, but this drives wasteful length growth on easy problems where correctness can be maintained at shorter lengths. An ideal intervention would:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Preserve hard problem performance&lt;/strong&gt;: Allow the model to “think long” when necessary&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prevent easy problem waste&lt;/strong&gt;: Discourage unnecessarily long responses on simple problems&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Adapt to problem difficulty&lt;/strong&gt;: Different problems have different “natural” solution lengths&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;length-penalty-variants&quot;&gt;Length Penalty Variants&lt;/h3&gt;

&lt;p&gt;We implement and evaluate three penalty mechanisms, each with different strategies for balancing efficiency and performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Variable definitions:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;$R \in {0, 1}$ is the binary correctness reward&lt;/li&gt;
  &lt;li&gt;$L$ is the response length in tokens&lt;/li&gt;
  &lt;li&gt;$\alpha, \beta, \gamma$ are penalty coefficients (hyperparameters)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;1-uniform-length-penalty&quot;&gt;1. Uniform Length Penalty&lt;/h4&gt;

\[\text{reward} = R - \alpha \times L\]

&lt;p&gt;The simplest approach applies a constant penalty per token across all problems, regardless of difficulty.&lt;/p&gt;

&lt;h4 id=&quot;2-group-based-adaptive-penalty-kimi-k15-10&quot;&gt;2. Group-Based Adaptive Penalty (Kimi K1.5 [10])&lt;/h4&gt;

\[\lambda = 0.5 - \frac{L - \text{min\_len}}{\text{max\_len} - \text{min\_len}}\]

\[\text{reward} = \begin{cases}
R + \beta \times \lambda &amp;amp; \text{(correct responses)} \\
R + \beta \times \min(0, \lambda) &amp;amp; \text{(incorrect responses)}
\end{cases}\]

&lt;p&gt;Inspired by Kimi K1.5 [10], this approach tracks the min/max response length per prompt and applies difficulty-aware penalties. It automatically adapts to each problem’s natural length range, rewarding brevity for correct responses while penalizing long incorrect ones. Note that $\lambda \in [-0.5, 0.5]$ normalizes the penalty based on the group’s length distribution.&lt;/p&gt;

&lt;h4 id=&quot;3-anchor-length-penalty&quot;&gt;3. Anchor Length Penalty&lt;/h4&gt;

\[r = \max\left(0, \frac{L - \text{anchor\_length}}{\text{anchor\_length}}\right)\]

\[\text{reward} = R - \gamma \times r\]

&lt;p&gt;where $\text{anchor_length}$ is set when the model first achieves reliable competence (pass@1 &amp;gt; 90%).&lt;/p&gt;

&lt;p&gt;This penalty is also adaptive per prompt, but uses a reference length from when the model first demonstrated competence on that problem, rather than the current group’s responses.&lt;/p&gt;

&lt;h3 id=&quot;experimental-results&quot;&gt;Experimental Results&lt;/h3&gt;

&lt;p&gt;In the experiments below, we set $\alpha=10^{-4}, \beta=0.1, \gamma=0.01$. These values were chosen so the penalty magnitude roughly matches the binary reward (given initial output lengths of $\sim 1000$ tokens).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: These are initial hyperparameter settings; further tuning may improve performance.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The figures below compare the four approaches across training:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Control (Pink) - No Penalty Baseline&lt;/strong&gt;: Shows continued length growth from ~1000 to ~1300 tokens as the model learns to solve harder problems, with easy problem lengths also increasing (the “overthinking” phenomenon).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Uniform Penalty (Orange) - The Exploration Tax&lt;/strong&gt;: Because the penalty is applied indiscriminately, with the weight $\alpha$ that we are applying, the model cannot afford the computational cost required to solve hard problems. The gradient signal is dominated by the length penalty, forcing the model to collapse to empty responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Group-Based Penalty (Green) - The Race to the Bottom&lt;/strong&gt;: While initially stable, the relative nature of this penalty creates a moving target. As the batch becomes more efficient, the definition of “long” shifts downwards. The model chases this moving target until it optimizes away the reasoning chain entirely, leading to a total collapse in capability (score drops to 0).
This is perhaps why in the original Kimi K1.5 the authors mention that they notice “this length reward will slow down training initially so they choose to gradually warm up the length penalty during training”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor Penalty (Teal) - Stable Efficiency&lt;/strong&gt;: By contrast, the Anchor Penalty successfully stabilizes the length. It arrests the heavy-tail drift seen in the control run but does not force the model below its necessary reasoning baseline. It effectively locks in the efficiency of the early training stages without suffocating exploration.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/penalty-comparison-length.svg&quot; alt=&quot;length-penalty-comparison-length&quot; /&gt;
&lt;img src=&quot;/assets/images/posts/2025-12-21-understanding-length-dynamics/penalty-comparison-score.svg&quot; alt=&quot;length-penalty-comparison-reward&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantitative comparison on easy vs. hard problems:&lt;/strong&gt;&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Checkpoint&lt;/th&gt;
      &lt;th&gt;Penalty&lt;/th&gt;
      &lt;th&gt;Easy Pass Rate&lt;/th&gt;
      &lt;th&gt;Easy Avg Length&lt;/th&gt;
      &lt;th&gt;Hard Pass Rate&lt;/th&gt;
      &lt;th&gt;Hard Avg Length&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 100&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;95.1%&lt;/td&gt;
      &lt;td&gt;530&lt;/td&gt;
      &lt;td&gt;2.3%&lt;/td&gt;
      &lt;td&gt;1039&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 1000&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;97.0%&lt;/td&gt;
      &lt;td&gt;661&lt;/td&gt;
      &lt;td&gt;5.2%&lt;/td&gt;
      &lt;td&gt;1288&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 1000&lt;/td&gt;
      &lt;td&gt;Anchor ($\gamma=0.01$)&lt;/td&gt;
      &lt;td&gt;97.0%&lt;/td&gt;
      &lt;td&gt;563&lt;/td&gt;
      &lt;td&gt;4.6%&lt;/td&gt;
      &lt;td&gt;958&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Step 1000&lt;/td&gt;
      &lt;td&gt;Anchor ($\gamma=0.1$)&lt;/td&gt;
      &lt;td&gt;96.5%&lt;/td&gt;
      &lt;td&gt;533&lt;/td&gt;
      &lt;td&gt;4.5%&lt;/td&gt;
      &lt;td&gt;1067&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;em&gt;All results are computed as the average of 64 samples per prompt.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Anchor Penalty with $\gamma=0.01$ successfully reduces easy problem length by 15% (661→563 tokens) and hard problem length by 26% (1301→958 tokens) while maintaining comparable accuracy to the no-penalty baseline.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Our empirical investigation reveals &lt;strong&gt;dataset difficulty composition as the primary factor determining length dynamics&lt;/strong&gt; under our Zero-RL setup.&lt;/p&gt;

&lt;h3 id=&quot;findings&quot;&gt;Findings&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Datasets with narrow difficulty range (MATH, Science) → Length convergence&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Models trained on datasets with similar difficulty to the model converge to stable output lengths&lt;/li&gt;
      &lt;li&gt;Neither KL divergence nor GRPO length bias explains this behavior&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mixed-difficulty datasets with heavy hard-tail (BigMath) → Length growth&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Datasets with difficulty imbalance produce U-shaped length dynamics&lt;/li&gt;
      &lt;li&gt;Removing hard problems eliminates consistent length growth&lt;/li&gt;
      &lt;li&gt;Hard problems drive the overall length increase&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The mechanism: Moving reward distributions cause asymmetric risk&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Hard problems get reward gains from longer reasoning (1039→1301 tokens, +262 tokens)&lt;/li&gt;
      &lt;li&gt;This policy shift affects all problems broadly (easy problems: 530→661 tokens, +131 tokens)&lt;/li&gt;
      &lt;li&gt;Models learn to maintain correctness on easy problems at longer lengths&lt;/li&gt;
      &lt;li&gt;The empirical reward distribution for easy problems flattens to accommodate this shift&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Difficulty-aware length penalties can mitigate overthinking without harming performance&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Uniform penalties cause exploration collapse by taxing all reasoning equally&lt;/li&gt;
      &lt;li&gt;Group-based adaptive penalties create unstable moving targets that might lead to capability collapse&lt;/li&gt;
      &lt;li&gt;Anchor-based penalties successfully stabilize length by locking to early-training efficiency&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;limitations-and-future-work&quot;&gt;Limitations and Future Work&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Model families and contamination:&lt;/strong&gt; Our study uses the Qwen model family with Qwen3-4B as the base model in the majority of the experiments. Recent work has noted that some Qwen variants may exhibit training-data leakage on math benchmarks, and our own case studies show examples where the model reaches the correct final answer with flawed intermediate reasoning. Our analysis therefore focuses on answer-level correctness and length dynamics for this particular model under binary verifier rewards. We do not claim that the observed behavior necessarily reflects faithful mathematical reasoning, nor that it generalizes to all model families.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Length penalty applicability:&lt;/strong&gt; The anchor penalty approach assumes that models start with reasonable output lengths and grow longer during training. It may not be suitable for models that are initially too verbose—for example, models initialized from checkpoints distilled with long chain-of-thought traces. In such cases, the anchor would be set at an already-inflated length, failing to provide efficiency gains. In those cases, a pre-set reference length like that proposed in the LASER paper [11] would be a good substitute. Our experiments use base models trained with RL from scratch (R1-Zero paradigm). A natural next step is evaluating these penalty mechanisms on supervised fine-tuned (SFT) models that may exhibit different initial length distributions. We plan to investigate this in the next phase of our experiments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Theoretical understanding:&lt;/strong&gt; While we’ve established the empirical patterns of length dynamics, the underlying mechanisms require further theoretical investigation. Key open questions include understanding why reward distributions exhibit inverted-U shapes across different lengths, how GRPO’s mode-matching behavior interacts with these reward curves, and whether we can develop predictive models for length dynamics on mixed-difficulty datasets.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] DeepSeek-AI. (2025). &lt;em&gt;DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning&lt;/em&gt;. Technical Report. https://arxiv.org/abs/2501.12948&lt;/p&gt;

&lt;p&gt;[2] DeepScaleR Team. (2025). &lt;em&gt;DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL&lt;/em&gt;. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2&lt;/p&gt;

&lt;p&gt;[3] HKUST NLP Group. (2025). &lt;em&gt;Simple-RL-Zero: Replicating DeepSeek-R1-Zero’s Reasoning with Reinforcement Learning&lt;/em&gt;. https://hkust-nlp.notion.site/simplerl-reason&lt;/p&gt;

&lt;p&gt;[4] Zhang, Y., et al. (2025). &lt;em&gt;When More is Less: Understanding Chain-of-Thought Length in LLMs&lt;/em&gt;. arXiv preprint.&lt;/p&gt;

&lt;p&gt;[5] NVIDIA. (2024). &lt;em&gt;Nemotron-4 340B Technical Report&lt;/em&gt;. https://arxiv.org/abs/2406.11704&lt;/p&gt;

&lt;p&gt;[6] Shao, Z., et al. (2024). &lt;em&gt;DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models&lt;/em&gt;. arXiv:2412.11911. https://arxiv.org/abs/2412.11911&lt;/p&gt;

&lt;p&gt;[7] Xiong, W., et al. (2025). &lt;em&gt;DAPO: An Open-Source LLM Reinforcement Learning System at Scale&lt;/em&gt;. arXiv:2503.14476. https://arxiv.org/abs/2503.14476&lt;/p&gt;

&lt;p&gt;[8] Luo, H., et al. (2025). &lt;em&gt;BigMath: A Large-Scale High-Quality Mathematical Dataset for Enhanced Reasoning&lt;/em&gt;. arXiv:2502.17387. https://arxiv.org/abs/2502.17387&lt;/p&gt;

&lt;p&gt;[9] Wei, Z., et al. (2025). &lt;em&gt;Concise Reasoning via Reinforcement Learning&lt;/em&gt;. arXiv:2504.05185. https://arxiv.org/abs/2504.05185&lt;/p&gt;

&lt;p&gt;[10] Moonshot AI. (2025). &lt;em&gt;Kimi k1.5: Scaling Reinforcement Learning with LLMs&lt;/em&gt;. Technical Report. https://github.com/MoonshotAI/Kimi-k1.5&lt;/p&gt;

&lt;p&gt;[11] Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y., Zhang, Y., and He, J. (2025). &lt;em&gt;Learn to Reason Efficiently with Adaptive Length-based Reward Shaping&lt;/em&gt;. arXiv:2505.15612. https://arxiv.org/abs/2505.15612&lt;/p&gt;

</description>
        <pubDate>Sun, 21 Dec 2025 00:00:00 -0600</pubDate>
        <link>https://zoeyli.com/reinforcement%20learning/research/understanding-length-dynamics/</link>
        <guid isPermaLink="true">https://zoeyli.com/reinforcement%20learning/research/understanding-length-dynamics/</guid>
      </item>
    
  </channel>
</rss>
