Understanding Length Dynamics in RL Training
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.
Published:
A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.
Published:
Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
Published:
A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.
Published:
A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.