Posts by Category

Reinforcement Learning

Off-Policy Corrections in LLM RL Training

23 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

What’s in Pass@K?

11 minute read

Published: January 30, 2026

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Implementing Training-Free Process Rewards in VeRL

9 minute read

Published: January 10, 2026

A training-free approach to process rewards: estimate V(prefix) via log-probability, compute marginal utility across episodes. Plus VeRL implementation pitfalls to avoid.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Research

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

(Zoey) Sha Li

Posts by Category

Reinforcement Learning

Off-Policy Corrections in LLM RL Training

What’s in Pass@K?

Implementing Training-Free Process Rewards in VeRL

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

Understanding Length Dynamics in RL Training

Research

Understanding Length Dynamics in RL Training