Sitemap
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Pages
Page Not Found
Page not found. Your pixels are in another canvas.
Page not in menu
This is a page not in the main menu
Posts
The Infrastructure Cost of MoE Routing Replay
Published:
Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.
A Reflection on Multi-Agent Role-Playing
Published:
Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.
Context Management for LLM Agents: A Memory Hierarchy View
Published:
How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.
Off-Policy Corrections in LLM RL Training
Published:
A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.
What’s in Pass@K?
Published:
Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.
Training-Free Process Rewards for LLM RL
Published:
A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.
Implementing On-Policy Distillation: Lessons from Building OPD in VeRL
Published:
On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.
Understanding Length Dynamics in RL Training
Published:
An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.
portfolio
Portfolio item number 1
Short description of portfolio item number 1
Portfolio item number 2
Short description of portfolio item number 2 
