Posts by Tags

MoE

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

PPO

RL

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

advantage-estimation

Training-Free Process Rewards for LLM RL

11 minute read

Published:

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

agents

A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

chain-of-thought

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

context management

data transfer

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

distributed systems

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

evaluation

What’s in Pass@K?

11 minute read

Published:

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

importance sampling

knowledge-distillation

length-dynamics

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

memory

multi-agent

A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

off-policy

on-policy-distillation

post-training

process-rewards

Training-Free Process Rewards for LLM RL

11 minute read

Published:

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

reasoning

What’s in Pass@K?

11 minute read

Published:

Pass@k is ubiquitous in evaluating reasoning models, but the metric is more subtle than it appears. Computing it correctly requires the unbiased estimator, and the nonlinearity of pass@k means it effectively upweights hard problems compared to pass@1.

Understanding Length Dynamics in RL Training

37 minute read

Published:

An empirical investigation into what drives output length growth during RL training, revealing that dataset difficulty composition is the primary driver behind the ‘overthinking’ phenomenon.

reinforcement-learning

Training-Free Process Rewards for LLM RL

11 minute read

Published:

A training-free approach to step-level credit assignment: estimate V(prefix) via log-probability, compute marginal utility across episodes — plus the implementation pitfalls that silently destroy the signal.

role-playing

A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

routing replay

The Infrastructure Cost of MoE Routing Replay

15 minute read

Published:

Routing replay (R3) stabilizes MoE RL training, but the routing data is 97% of the generation payload. This post traces the bottleneck — the single-threaded manager pipeline, not bandwidth — and the failed ‘obvious’ fix that revealed a fundamental constraint of mixing NCCL with inference.

simulation

A Reflection on Multi-Agent Role-Playing

23 minute read

Published:

Role-playing was the first multi-agent pattern — assign personas, let agents debate or collaborate. But it was largely a product of 2023-2024 model capabilities. As models improve, the real value of multi-agent systems turns out to be structural.

verl