Implementing Training-Free Process Rewards in VeRL

9 minute read

Published:

Motivation

Current RLVR (RL from Verifiable Rewards) frameworks typically assign a single reward at the end of a response—correct or incorrect. But not all reasoning steps contribute equally. Some are critical insights, others are routine algebra, and some may be wasteful exploration.

Process rewards assign credit to intermediate steps, enabling step-level credit assignment. This unlocks several applications:

  • Better training efficiency: Denser reward signal provides more gradient information per sample
  • Breaking out of zero pass rate: When a problem is too hard for any complete solution, partial progress can still be rewarded
  • Less verbosity: Penalize unproductive reasoning loops and overthinking

The Challenge

How do you obtain step-level reward signals? Two main approaches:

  1. Train a Process Reward Model (PRM): Requires labeled data for intermediate steps [1] or a strong model judge
  2. Training-free signals: Derive rewards from the policy itself

Note that if you have a well-trained value network, that naturally provides token-level process rewards via TD-error. Value network training is out of scope for this post—we focus on training-free approaches that work with critic-free algorithms like RLOO/GRPO.

Training-Free Process Rewards

Monte Carlo Estimation

A key insight from VinePPO [2]: language environments are naturally “resettable”—you can return to any intermediate state simply by prompting with that prefix. This enables estimating V(prefix) at any point in a reasoning trace.

VinePPO uses Monte Carlo estimation: sample K complete rollouts from each prefix and average their outcomes. This is expensive—for N steps and K samples, you need N×K rollouts per training example.

VinePPO illustration

Log-Probability Approximation

We approximate V(prefix) more efficiently using a single forward pass:

\[V(\text{prefix}) \approx \frac{1}{n}\sum_{i=1}^{n} \log P(a_i \mid \text{prefix} + \text{force\_prompt} + a_{<i})\]

Instead of sampling K complete rollouts, we:

  1. Truncate the response at an episode boundary
  2. Append a “force answer” prompt (e.g., </think>\n\nThe answer is )
  3. Compute the mean log-probability of the ground-truth answer tokens

We use log-probability directly (not converted to probability) for numerical stability. This estimates “if the model were forced to answer now, how likely would it produce the correct answer?”

Assumptions and limitations:

  1. A small set of correct answers (e.g., math problems with a single numerical answer). If the space of correct answers is large (e.g., open-ended instruction following), this approach won’t work.
  2. A compatible force-answer prompt. The prompt (e.g., </think>\n\nThe answer is ) must be consistent with the model’s chat template and training format.
  3. Biased estimate. Unlike Monte Carlo, this does not provide an unbiased estimate of V(prefix). We are effectively using the base model with a force-answer prompt as a prover policy [4][5]—a separate policy that completes the solution from an intermediate state. The quality of V(prefix) depends on how well this prover correlates with the true probability of success.

Episode Segmentation

To identify intermediate states, we segment reasoning traces into “episodes” using discourse markers:

EPISODE_MARKERS = [
    "Wait,", "Alternatively,", "Actually,", "Hmm,",
    "Let me ", "I need to ", "So ", "But ",
    # ... more markers
]

Each marker indicates a potential state boundary where we can evaluate V(prefix).

We also use a token length fallback: if no markers are found within a maximum token limit (e.g., 256 tokens), we split at sentence boundaries. This prevents issues where the model produces no markers at all, which would result in a single giant episode.

flowchart LR
    A[Full Response] --> B[Segment by markers]
    B --> C[Episode 1: Problem setup]
    B --> D[Episode 2: Initial approach]
    B --> E[Episode 3: Wait, let me reconsider...]
    B --> F[Episode N: Final answer]

    C --> G["V(prefix₀)"]
    D --> H["V(prefix₁)"]
    E --> I["V(prefix₂)"]
    F --> J[Final reward]

Marginal Utility

With V(prefix) at each episode boundary, we compute marginal utility:

\[U_i = V(\text{prefix}_i) - V(\text{prefix}_{i-1})\]

Since V(prefix) is in log-probability space, the difference measures the log-odds improvement from each episode.

  • U_i > 0: Episode i made progress
  • U_i < 0: Episode i was counterproductive
  • U_i ≈ 0: Episode i didn’t change much (possibly wasteful)

Implementing in VeRL

The natural approach is:

  1. Create a RewardManager that returns reward_tensor: [batch_size, seq_len] with non-zero values throughout the sequence
  2. Let VeRL’s training loop handle the rest

But there are subtle issues at every layer of the stack.

Reward Manager Architecture

VeRL provides two architectures for reward computation:

Legacy RewardManager (Synchronous)

The original approach: implement a RewardManager class that’s called synchronously during training.

class StepProgressRewardManager:
    def __call__(self, data: DataProto) -> torch.Tensor:
        # Compute rewards synchronously
        return reward_tensor  # [batch_size, seq_len]

Characteristics:

  • Has access to the actor model
  • Blocks the training loop during computation
  • Simple to implement and debug

Important: To use a custom RewardManager, you must set use_reward_loop=False in your config. Otherwise, VeRL defaults to the RewardLoopManager and silently bypasses your custom reward manager—a subtle source of bugs.

RewardLoopManager (Async) — Now Default

The default code path in VeRL when reward_model.enable=True. Runs reward computation asynchronously via a separate model server.

class CustomRewardLoopManager:
    def __init__(self, reward_model, ...):
        self.reward_model = reward_model  # Separate model instance (e.g., vLLM)

    async def compute_rewards(self, data):
        # Async reward computation
        return reward_tensor

Characteristics:

  • Async processing—doesn’t block training
  • Well-suited for external reward models: trained RMs, LLM-as-judge, rule-based verifiers
  • Does not have access to the actor model’s current weights

When to Use Which

Use CaseRecommended
External RM (trained reward model)RewardLoopManager ✅
LLM-as-judgeRewardLoopManager ✅
Rule-based verificationEither works
Rewards derived from actor model (e.g., V(prefix))Legacy RewardManager ✅

Key consideration: If your reward computation requires the current policy (e.g., estimating V(prefix) for process rewards), the RewardLoopManager creates synchronization issues—the reward model copy can diverge from the actor during training. In this case, the legacy RewardManager is more appropriate.

Optimization: Pre-compute During Generation

Computing V(prefix) requires forward passes through the actor model. If done during reward computation, this blocks the training loop.

Solution: Pre-compute during generation, when the actor is already loaded.

# In generation phase:
# 1. Generate response
# 2. Segment into episodes
# 3. For each episode boundary:
#    - Construct prefix + force_answer_prompt + ground_truth
#    - Compute log_probs using actor (already loaded!)
#    - Store V(prefix) in batch.meta_info["prefix_value_cache"]

# In training phase:
# 1. Retrieve pre-computed V(prefix) from cache
# 2. Compute marginal utilities (cheap math)
# 3. Compute process rewards (cheap math)

This gave us 60-70% speedup compared to computing everything during training.

Advantage Estimation

Even with correct process rewards computed, most advantage estimators will destroy the fine-grained structure.

The Data Flow

Here’s how rewards flow through VeRL’s training pipeline:

flowchart TD
    A[RewardManager returns reward_tensor]
    B[compute_reward in trainer/ppo/reward.py]
    C[Store as token_level_scores]
    D[apply_kl_penalty optional]
    E[Store as token_level_rewards]
    F[compute_advantage in trainer/ppo/core_algos.py]
    G[GAE: Uses per-token rewards]
    H[GRPO: Sums to scalar]
    I[Token-level advantages]
    J[Scalar advantages]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    G --> I
    H --> J

    classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
    classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px

    class G good
    class H bad

The subtle issue: Most advantage estimators immediately collapse token-level rewards to scalars:

# Standard GRPO in VeRL (core_algos.py:301)
scores = token_level_rewards.sum(dim=-1)  # [batch, seq_len] → [batch]
advantages = (scores - group_mean) / group_std  # scalar per sequence

This destroys the fine-grained credit assignment you carefully designed! Your process rewards are summed into a single number before computing advantages.

The critical question: Does your advantage estimator use token_level_rewards[:, t] at each timestep, or does it sum first?

Token-Level Preserving Estimators

You need an advantage estimator that operates on token_level_rewards[:, t] at each timestep. For example, the GAE estimator function preserves this structure:

for t in reversed(range(seq_len)):
    delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
    lastgaelam = delta + gamma * lam * lastgaelam

If you want to keep using GRPO, you will need to implement your own estimator:

# Collect all rewards from group
group_rewards = torch.cat([rewards[i, mask[i]] for i in group])

# Normalize each token by group statistics
mean_R = group_rewards.mean()
std_R = group_rewards.std()
rewards_normalized = (rewards - mean_R) / std_R

# Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)

Normalization Pitfall: Mixed Reward Scales

When combining outcome rewards with process rewards, normalize them separately.

A subtle bug: normalizing outcome rewards (scale: 0-1) together with process rewards (scale: ~±0.03) makes the process reward signal negligible.

Fix: Identify and normalize each reward type separately:

# Outcome rewards at last token (like standard GRPO)
outcome_mask = ...  # last valid token of each response
id2outcome[group_idx].append(token_level_rewards[i][outcome_mask[i]])

# Process rewards at episode boundaries (separate normalization)
process_mask = nonzero_mask & ~outcome_mask
id2process[group_idx].append(token_level_rewards[i][process_mask[i]])

Summary

In this post, we discuss possible ways of obtaining a training-free process reward (that does not rely on an external reward model) and walked through the implementation details in VeRL.

A few takeaways:

  1. Log-probability approximation is efficient: Estimating V(prefix) via $\log P(\text{answer} \mid \text{prefix})$ requires one forward pass, vs K×N rollouts for MC estimation

  2. Marginal utility captures step-level progress: $U_i = V(\text{prefix}i) - V(\text{prefix}{i-1})$ measures how much each episode helps or hurts

  3. Choose the right reward manager architecture: Legacy RewardManager works better when rewards depend on the current policy; RewardLoopManager (now default) is designed for external reward models

  4. Pre-compute during generation: Move V(prefix) computation out of the training critical path for 60-70% speedup

  5. Use token-level advantage estimators: Standard GRPO collapses to scalars—use GAE or token-preserving GRPO

  6. Normalize reward types separately: Mixed-scale rewards (outcome + process) need separate normalization to preserve signal strength

References

[1] Lightman, Hunter, et al. “Let’s verify step by step.” The Twelfth International Conference on Learning Representations. 2023. https://arxiv.org/pdf/2305.20050

[2] Kazemnejad et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. https://arxiv.org/abs/2410.01679

[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl

[4] Setlur et al. (2024). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. https://arxiv.org/abs/2410.08146

[5] Qu et al. (2025). Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning. https://arxiv.org/abs/2503.07572