Blog posts

2026

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

What’s in Pass@K?

11 minute read

Published:

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Training-Free Process Rewards for LLM RL

11 minute read

Published:

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

2025

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.