Posts by Tags

RL

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

advantage-estimation

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published:

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

chain-of-thought

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

evaluation

What’s in Pass@K?

11 minute read

Published:

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

knowledge-distillation

length-dynamics

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

on-policy-distillation

process-rewards

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published:

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

reasoning

What’s in Pass@K?

11 minute read

Published:

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

reinforcement-learning

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published:

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

verl

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published:

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.