Posts by Tags

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published: January 10, 2026

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

What’s in Pass@K?

11 minute read

Published: January 30, 2026

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published: January 10, 2026

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

What’s in Pass@K?

11 minute read

Published: January 30, 2026

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published: January 10, 2026

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published: January 10, 2026

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

(Zoey) Sha Li

Posts by Tags

RL

Understanding Length Dynamics in RL Training

advantage-estimation

Implementing Training-Free Process Rewards in VeRL

chain-of-thought

Understanding Length Dynamics in RL Training

evaluation

What’s in Pass@K?

knowledge-distillation

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

length-dynamics

Understanding Length Dynamics in RL Training

on-policy-distillation

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

process-rewards

Implementing Training-Free Process Rewards in VeRL

reasoning

What’s in Pass@K?

Understanding Length Dynamics in RL Training

reinforcement-learning

Implementing Training-Free Process Rewards in VeRL

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

verl

Implementing Training-Free Process Rewards in VeRL

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL