Posts by Tags

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published: April 26, 2026

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

Off-Policy Corrections in LLM RL Training

24 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published: April 26, 2026

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published: April 18, 2026

How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.

Off-Policy Corrections in LLM RL Training

24 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Training-Free Process Rewards for LLM RL

11 minute read

Published: January 10, 2026

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

A Reflection on Multi-Agent Role-Playing

23 minute read

Published: April 20, 2026

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published: April 18, 2026

How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published: April 18, 2026

How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published: April 26, 2026

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published: April 26, 2026

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

What’s in Pass@K?

11 minute read

Published: January 30, 2026

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Off-Policy Corrections in LLM RL Training

24 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published: April 18, 2026

How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.

A Reflection on Multi-Agent Role-Playing

23 minute read

Published: April 20, 2026

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

Off-Policy Corrections in LLM RL Training

24 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published: April 18, 2026

How LLM agents learn to manage their own context — from harness-driven compaction to memory tools and sub-agents — and why this may be the key bottleneck for long-horizon reasoning.

Off-Policy Corrections in LLM RL Training

24 minute read

Published: March 01, 2026

A unified treatment of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

Training-Free Process Rewards for LLM RL

11 minute read

Published: January 10, 2026

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

What’s in Pass@K?

11 minute read

Published: January 30, 2026

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Understanding Length Dynamics in RL Training

37 minute read

Published: December 21, 2025

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

Training-Free Process Rewards for LLM RL

11 minute read

Published: January 10, 2026

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

A Reflection on Multi-Agent Role-Playing

23 minute read

Published: April 20, 2026

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published: April 26, 2026

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

A Reflection on Multi-Agent Role-Playing

23 minute read

Published: April 20, 2026

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

18 minute read

Published: January 06, 2026

On-policy distillation integrates teacher guidance into RL training, but the implementation is full of silent failures. This post documents the architecture, pitfalls, and design choices from building OPD in VeRL.

(Zoey) Sha Li

Posts by Tags

MoE

PPO

RL

advantage-estimation

agents

chain-of-thought

context management

data transfer

distributed systems

evaluation

importance sampling

knowledge-distillation

length-dynamics

memory

multi-agent

off-policy

on-policy-distillation

post-training

process-rewards

reasoning

reinforcement-learning

role-playing

routing replay

simulation

verl