Off-Policy Corrections in LLM RL Training
Published:
A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.
Published:
A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.
Published:
Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.
Published:
A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.