Implementing Process Rewards in VeRL

4 minute read

Published:

TL;DR

Using process rewards in VeRL requires two components: (1) a RewardManager that computes token-level rewards, and (2) an advantage estimator that preserves token-level structure. Most standard algorithms (GRPO, RLOO) collapse rewards to scalars, defeating the purpose of process rewards.

The Problem

Process rewards assign credit to intermediate steps, not just final outcomes. For example, in math problem solving, you might want to reward partial progress through reasoning steps.

The natural approach is:

  1. Create a RewardManager that returns reward_tensor: [batch_size, seq_len] with non-zero values throughout the sequence
  2. Let VeRL’s training loop handle the rest

But there’s a subtle issue: most advantage estimators in VeRL immediately collapse token-level rewards to scalars:

# Standard GRPO in VeRL (core_algos.py:301)
scores = token_level_rewards.sum(dim=-1)  # [batch, seq_len] → [batch]
advantages = (scores - group_mean) / group_std  # scalar per sequence

This destroys the fine-grained credit assignment you carefully designed! Your process rewards are summed into a single number before computing advantages.

Understanding the Data Flow

Here’s how rewards flow through VeRL’s training pipeline:

flowchart TD
    A[RewardManager returns reward_tensor]
    B[compute_reward in trainer/ppo/reward.py]
    C[Store as token_level_scores]
    D[apply_kl_penalty optional]
    E[Store as token_level_rewards]
    F[compute_advantage in trainer/ppo/core_algos.py]
    G[GAE: Uses per-token rewards]
    H[GRPO: Sums to scalar]
    I[Token-level advantages]
    J[Scalar advantages]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    G --> I
    H --> J

    classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
    classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px

    class G good
    class H bad

The critical question: Does your advantage estimator use token_level_rewards[:, t] at each timestep, or does it sum first?

The Solution: Token-Level Preserving Estimators

You need an advantage estimator that operates on token_level_rewards[:, t] at each timestep.

Built-in: GAE ✅

File: verl/trainer/ppo/core_algos.py:213

for t in reversed(range(seq_len)):
    delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
    lastgaelam = delta + gamma * lam * lastgaelam

✅ Uses token_level_rewards[:, t] at every timestep

✅ Computes TD-error with your process rewards

✅ Most tested and stable

❌ Requires training a critic

Built-in: REINFORCE++ ✅

File: verl/trainer/ppo/core_algos.py:591

for t in reversed(range(seq_len)):
    running_return = token_level_rewards[:, t] + gamma * running_return
    returns[:, t] = running_return

✅ Uses token_level_rewards[:, t] at each timestep

✅ Simple, no critic needed

❌ Higher variance than GAE

DeepSeekMath’s GRPO ✅

Standard GRPO collapses to scalars, but DeepSeekMath’s paper describes a token-preserving variant.

Key insight: Normalize before aggregating, not after.

# Collect all rewards from group
group_rewards = torch.cat([rewards[i, mask[i]] for i in group])

# Normalize each token by group statistics
mean_R = group_rewards.mean()
std_R = group_rewards.std()
rewards_normalized = (rewards - mean_R) / std_R

# Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)

The order matters:

VeRL's GRPO:          rewards → sum → normalize → scalar ❌
DeepSeekMath's GRPO:  rewards → normalize → cumsum → token-level ✅

✅ No critic needed ✅ Group-relative normalization ✅ Simple (3 core operations)

Alternative: PRIME’s RLOO ✅

PRIME provides a custom token-level RLOO in their recipe.

# Per-sample mean first
mean_i = rewards[i, mask[i]].mean()

# Leave-one-out baseline
baseline = sum(mean_i for i in samples) / (n - 1)

# Apply RLOO transform, then cumsum
rewards_rloo = rewards * (n/(n-1)) - baseline
returns = rewards_rloo.flip(-1).cumsum(-1).flip(-1)

✅ Token-level structure preserved ✅ Leave-one-out variance reduction ❌ Not in VeRL core (requires PRIME recipe)

Comparison: DeepSeekMath GRPO vs PRIME RLOO

Both preserve token-level structure but differ in how they compute baselines:

AspectDeepSeekMath GRPOPRIME RLOO
Baseline granularityToken-level (all tokens pooled)Sample-level (sample means first)
NormalizationGlobal z-scoreRLOO transform + whitening

Example

Given 4 samples with different lengths:

Sample 0: [0.1, 0.2, 0.3]  (mean=0.2)    Sample 1: [0.4, 0.5]  (mean=0.45)
Sample 2: [0.2, 0.1, 0.2, 0.1]  (mean=0.15)    Sample 3: [0.3, 0.4, 0.3]  (mean=0.333)

DeepSeekMath: Pools all 12 tokens → mean = 0.258 (token-weighted)

PRIME: Pools 4 sample means → baseline = 0.378 (sample-weighted)

Implementation: DeepSeekMath GRPO

DeepSeekMath’s token-preserving GRPO can be implemented as a new advantage estimator:

@register_adv_est(AdvantageEstimator.GRPO_PROCESS_REWARD)
def compute_grpo_with_process_reward_advantage(...):
    # 1. Group samples and collect all rewards per group
    id2rewards = defaultdict(list)
    for i in range(bsz):
        sample_rewards = token_level_rewards[i][response_mask[i].bool()]
        id2rewards[index[i]].append(sample_rewards)

    # 2. Compute group-level mean/std
    for idx in id2rewards:
        group_rewards = torch.cat(id2rewards[idx])
        id2mean[idx] = group_rewards.mean()
        id2std[idx] = group_rewards.std()

    # 3. Normalize each token by group statistics
    rewards_normalized[i] = (token_level_rewards[i] - id2mean[idx]) / id2std[idx]

    # 4. Compute advantages as cumulative sum
    advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)
    return advantages, advantages

Key Takeaways

  1. Process rewards require token-level advantage estimators - standard GRPO/RLOO in VeRL collapse to scalars

  2. Check the data flow: Verify your estimator uses token_level_rewards[:, t] at each timestep, not sum(token_level_rewards)

  3. Four viable options:

    • GAE (most tested, requires critic)
    • REINFORCE++ (simplest, no critic)
    • DeepSeekMath GRPO (simple, no critic, group normalization)
    • PRIME RLOO (sample fairness, more complex)