Implementing Process Rewards in VeRL
Published:
TL;DR
Using process rewards in VeRL requires two components: (1) a RewardManager that computes token-level rewards, and (2) an advantage estimator that preserves token-level structure. Most standard algorithms (GRPO, RLOO) collapse rewards to scalars, defeating the purpose of process rewards.
The Problem
Process rewards assign credit to intermediate steps, not just final outcomes. For example, in math problem solving, you might want to reward partial progress through reasoning steps.
The natural approach is:
- Create a
RewardManagerthat returnsreward_tensor: [batch_size, seq_len]with non-zero values throughout the sequence - Let VeRL’s training loop handle the rest
But there’s a subtle issue: most advantage estimators in VeRL immediately collapse token-level rewards to scalars:
# Standard GRPO in VeRL (core_algos.py:301)
scores = token_level_rewards.sum(dim=-1) # [batch, seq_len] → [batch]
advantages = (scores - group_mean) / group_std # scalar per sequence
This destroys the fine-grained credit assignment you carefully designed! Your process rewards are summed into a single number before computing advantages.
Understanding the Data Flow
Here’s how rewards flow through VeRL’s training pipeline:
flowchart TD
A[RewardManager returns reward_tensor]
B[compute_reward in trainer/ppo/reward.py]
C[Store as token_level_scores]
D[apply_kl_penalty optional]
E[Store as token_level_rewards]
F[compute_advantage in trainer/ppo/core_algos.py]
G[GAE: Uses per-token rewards]
H[GRPO: Sums to scalar]
I[Token-level advantages]
J[Scalar advantages]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
F --> H
G --> I
H --> J
classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px
class G good
class H bad
The critical question: Does your advantage estimator use token_level_rewards[:, t] at each timestep, or does it sum first?
The Solution: Token-Level Preserving Estimators
You need an advantage estimator that operates on token_level_rewards[:, t] at each timestep.
Built-in: GAE ✅
File: verl/trainer/ppo/core_algos.py:213
for t in reversed(range(seq_len)):
delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
lastgaelam = delta + gamma * lam * lastgaelam
✅ Uses token_level_rewards[:, t] at every timestep
✅ Computes TD-error with your process rewards
✅ Most tested and stable
❌ Requires training a critic
Built-in: REINFORCE++ ✅
File: verl/trainer/ppo/core_algos.py:591
for t in reversed(range(seq_len)):
running_return = token_level_rewards[:, t] + gamma * running_return
returns[:, t] = running_return
✅ Uses token_level_rewards[:, t] at each timestep
✅ Simple, no critic needed
❌ Higher variance than GAE
DeepSeekMath’s GRPO ✅
Standard GRPO collapses to scalars, but DeepSeekMath’s paper describes a token-preserving variant.
Key insight: Normalize before aggregating, not after.
# Collect all rewards from group
group_rewards = torch.cat([rewards[i, mask[i]] for i in group])
# Normalize each token by group statistics
mean_R = group_rewards.mean()
std_R = group_rewards.std()
rewards_normalized = (rewards - mean_R) / std_R
# Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)
The order matters:
VeRL's GRPO: rewards → sum → normalize → scalar ❌
DeepSeekMath's GRPO: rewards → normalize → cumsum → token-level ✅
✅ No critic needed ✅ Group-relative normalization ✅ Simple (3 core operations)
Alternative: PRIME’s RLOO ✅
PRIME provides a custom token-level RLOO in their recipe.
# Per-sample mean first
mean_i = rewards[i, mask[i]].mean()
# Leave-one-out baseline
baseline = sum(mean_i for i in samples) / (n - 1)
# Apply RLOO transform, then cumsum
rewards_rloo = rewards * (n/(n-1)) - baseline
returns = rewards_rloo.flip(-1).cumsum(-1).flip(-1)
✅ Token-level structure preserved ✅ Leave-one-out variance reduction ❌ Not in VeRL core (requires PRIME recipe)
Comparison: DeepSeekMath GRPO vs PRIME RLOO
Both preserve token-level structure but differ in how they compute baselines:
| Aspect | DeepSeekMath GRPO | PRIME RLOO |
|---|---|---|
| Baseline granularity | Token-level (all tokens pooled) | Sample-level (sample means first) |
| Normalization | Global z-score | RLOO transform + whitening |
Example
Given 4 samples with different lengths:
Sample 0: [0.1, 0.2, 0.3] (mean=0.2) Sample 1: [0.4, 0.5] (mean=0.45)
Sample 2: [0.2, 0.1, 0.2, 0.1] (mean=0.15) Sample 3: [0.3, 0.4, 0.3] (mean=0.333)
DeepSeekMath: Pools all 12 tokens → mean = 0.258 (token-weighted)
PRIME: Pools 4 sample means → baseline = 0.378 (sample-weighted)
Implementation: DeepSeekMath GRPO
DeepSeekMath’s token-preserving GRPO can be implemented as a new advantage estimator:
@register_adv_est(AdvantageEstimator.GRPO_PROCESS_REWARD)
def compute_grpo_with_process_reward_advantage(...):
# 1. Group samples and collect all rewards per group
id2rewards = defaultdict(list)
for i in range(bsz):
sample_rewards = token_level_rewards[i][response_mask[i].bool()]
id2rewards[index[i]].append(sample_rewards)
# 2. Compute group-level mean/std
for idx in id2rewards:
group_rewards = torch.cat(id2rewards[idx])
id2mean[idx] = group_rewards.mean()
id2std[idx] = group_rewards.std()
# 3. Normalize each token by group statistics
rewards_normalized[i] = (token_level_rewards[i] - id2mean[idx]) / id2std[idx]
# 4. Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)
return advantages, advantages
Key Takeaways
Process rewards require token-level advantage estimators - standard GRPO/RLOO in VeRL collapse to scalars
Check the data flow: Verify your estimator uses
token_level_rewards[:, t]at each timestep, notsum(token_level_rewards)Four viable options:
- GAE (most tested, requires critic)
- REINFORCE++ (simplest, no critic)
- DeepSeekMath GRPO (simple, no critic, group normalization)
- PRIME RLOO (sample fairness, more complex)