Implementing Training-Free Process Rewards in VeRL
Published:
Motivation
Current RLVR (RL from Verifiable Rewards) frameworks typically assign a single reward at the end of a response—correct or incorrect. But not all reasoning steps contribute equally. Some are critical insights, others are routine algebra, and some may be wasteful exploration.
Process rewards assign credit to intermediate steps, enabling step-level credit assignment. This unlocks several applications:
- Better training efficiency: Denser reward signal provides more gradient information per sample
- Breaking out of zero pass rate: When a problem is too hard for any complete solution, partial progress can still be rewarded
- Less verbosity: Penalize unproductive reasoning loops and overthinking
The Challenge
How do you obtain step-level reward signals? Two main approaches:
- Train a Process Reward Model (PRM): Requires labeled data for intermediate steps [1] or a strong model judge
- Training-free signals: Derive rewards from the policy itself
Note that if you have a well-trained value network, that naturally provides token-level process rewards via TD-error. Value network training is out of scope for this post—we focus on training-free approaches that work with critic-free algorithms like RLOO/GRPO.
Training-Free Process Rewards
Monte Carlo Estimation
A key insight from VinePPO [2]: language environments are naturally “resettable”—you can return to any intermediate state simply by prompting with that prefix. This enables estimating V(prefix) at any point in a reasoning trace.
VinePPO uses Monte Carlo estimation: sample K complete rollouts from each prefix and average their outcomes. This is expensive—for N steps and K samples, you need N×K rollouts per training example.

Log-Probability Approximation
We approximate V(prefix) more efficiently using a single forward pass:
\[V(\text{prefix}) \approx \frac{1}{n}\sum_{i=1}^{n} \log P(a_i \mid \text{prefix} + \text{force\_prompt} + a_{<i})\]Instead of sampling K complete rollouts, we:
- Truncate the response at an episode boundary
- Append a “force answer” prompt (e.g.,
</think>\n\nThe answer is) - Compute the mean log-probability of the ground-truth answer tokens
We use log-probability directly (not converted to probability) for numerical stability. This estimates “if the model were forced to answer now, how likely would it produce the correct answer?”
Assumptions and limitations:
- A small set of correct answers (e.g., math problems with a single numerical answer). If the space of correct answers is large (e.g., open-ended instruction following), this approach won’t work.
- A compatible force-answer prompt. The prompt (e.g.,
</think>\n\nThe answer is) must be consistent with the model’s chat template and training format. - Biased estimate. Unlike Monte Carlo, this does not provide an unbiased estimate of V(prefix). We are effectively using the base model with a force-answer prompt as a prover policy [4][5]—a separate policy that completes the solution from an intermediate state. The quality of V(prefix) depends on how well this prover correlates with the true probability of success.
Episode Segmentation
To identify intermediate states, we segment reasoning traces into “episodes” using discourse markers:
EPISODE_MARKERS = [
"Wait,", "Alternatively,", "Actually,", "Hmm,",
"Let me ", "I need to ", "So ", "But ",
# ... more markers
]
Each marker indicates a potential state boundary where we can evaluate V(prefix).
We also use a token length fallback: if no markers are found within a maximum token limit (e.g., 256 tokens), we split at sentence boundaries. This prevents issues where the model produces no markers at all, which would result in a single giant episode.
flowchart LR
A[Full Response] --> B[Segment by markers]
B --> C[Episode 1: Problem setup]
B --> D[Episode 2: Initial approach]
B --> E[Episode 3: Wait, let me reconsider...]
B --> F[Episode N: Final answer]
C --> G["V(prefix₀)"]
D --> H["V(prefix₁)"]
E --> I["V(prefix₂)"]
F --> J[Final reward]
Marginal Utility
With V(prefix) at each episode boundary, we compute marginal utility:
\[U_i = V(\text{prefix}_i) - V(\text{prefix}_{i-1})\]Since V(prefix) is in log-probability space, the difference measures the log-odds improvement from each episode.
U_i > 0: Episode i made progressU_i < 0: Episode i was counterproductiveU_i ≈ 0: Episode i didn’t change much (possibly wasteful)
Implementing in VeRL
The natural approach is:
- Create a
RewardManagerthat returnsreward_tensor: [batch_size, seq_len]with non-zero values throughout the sequence - Let VeRL’s training loop handle the rest
But there are subtle issues at every layer of the stack.
Reward Manager Architecture
VeRL provides two architectures for reward computation:
Legacy RewardManager (Synchronous)
The original approach: implement a RewardManager class that’s called synchronously during training.
class StepProgressRewardManager:
def __call__(self, data: DataProto) -> torch.Tensor:
# Compute rewards synchronously
return reward_tensor # [batch_size, seq_len]
Characteristics:
- Has access to the actor model
- Blocks the training loop during computation
- Simple to implement and debug
Important: To use a custom RewardManager, you must set use_reward_loop=False in your config. Otherwise, VeRL defaults to the RewardLoopManager and silently bypasses your custom reward manager—a subtle source of bugs.
RewardLoopManager (Async) — Now Default
The default code path in VeRL when reward_model.enable=True. Runs reward computation asynchronously via a separate model server.
class CustomRewardLoopManager:
def __init__(self, reward_model, ...):
self.reward_model = reward_model # Separate model instance (e.g., vLLM)
async def compute_rewards(self, data):
# Async reward computation
return reward_tensor
Characteristics:
- Async processing—doesn’t block training
- Well-suited for external reward models: trained RMs, LLM-as-judge, rule-based verifiers
- Does not have access to the actor model’s current weights
When to Use Which
| Use Case | Recommended |
|---|---|
| External RM (trained reward model) | RewardLoopManager ✅ |
| LLM-as-judge | RewardLoopManager ✅ |
| Rule-based verification | Either works |
| Rewards derived from actor model (e.g., V(prefix)) | Legacy RewardManager ✅ |
Key consideration: If your reward computation requires the current policy (e.g., estimating V(prefix) for process rewards), the RewardLoopManager creates synchronization issues—the reward model copy can diverge from the actor during training. In this case, the legacy RewardManager is more appropriate.
Optimization: Pre-compute During Generation
Computing V(prefix) requires forward passes through the actor model. If done during reward computation, this blocks the training loop.
Solution: Pre-compute during generation, when the actor is already loaded.
# In generation phase:
# 1. Generate response
# 2. Segment into episodes
# 3. For each episode boundary:
# - Construct prefix + force_answer_prompt + ground_truth
# - Compute log_probs using actor (already loaded!)
# - Store V(prefix) in batch.meta_info["prefix_value_cache"]
# In training phase:
# 1. Retrieve pre-computed V(prefix) from cache
# 2. Compute marginal utilities (cheap math)
# 3. Compute process rewards (cheap math)
This gave us 60-70% speedup compared to computing everything during training.
Advantage Estimation
Even with correct process rewards computed, most advantage estimators will destroy the fine-grained structure.
The Data Flow
Here’s how rewards flow through VeRL’s training pipeline:
flowchart TD
A[RewardManager returns reward_tensor]
B[compute_reward in trainer/ppo/reward.py]
C[Store as token_level_scores]
D[apply_kl_penalty optional]
E[Store as token_level_rewards]
F[compute_advantage in trainer/ppo/core_algos.py]
G[GAE: Uses per-token rewards]
H[GRPO: Sums to scalar]
I[Token-level advantages]
J[Scalar advantages]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
F --> H
G --> I
H --> J
classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px
class G good
class H bad
The subtle issue: Most advantage estimators immediately collapse token-level rewards to scalars:
# Standard GRPO in VeRL (core_algos.py:301)
scores = token_level_rewards.sum(dim=-1) # [batch, seq_len] → [batch]
advantages = (scores - group_mean) / group_std # scalar per sequence
This destroys the fine-grained credit assignment you carefully designed! Your process rewards are summed into a single number before computing advantages.
The critical question: Does your advantage estimator use token_level_rewards[:, t] at each timestep, or does it sum first?
Token-Level Preserving Estimators
You need an advantage estimator that operates on token_level_rewards[:, t] at each timestep. For example, the GAE estimator function preserves this structure:
for t in reversed(range(seq_len)):
delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
lastgaelam = delta + gamma * lam * lastgaelam
If you want to keep using GRPO, you will need to implement your own estimator:
# Collect all rewards from group
group_rewards = torch.cat([rewards[i, mask[i]] for i in group])
# Normalize each token by group statistics
mean_R = group_rewards.mean()
std_R = group_rewards.std()
rewards_normalized = (rewards - mean_R) / std_R
# Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)
Normalization Pitfall: Mixed Reward Scales
When combining outcome rewards with process rewards, normalize them separately.
A subtle bug: normalizing outcome rewards (scale: 0-1) together with process rewards (scale: ~±0.03) makes the process reward signal negligible.
Fix: Identify and normalize each reward type separately:
# Outcome rewards at last token (like standard GRPO)
outcome_mask = ... # last valid token of each response
id2outcome[group_idx].append(token_level_rewards[i][outcome_mask[i]])
# Process rewards at episode boundaries (separate normalization)
process_mask = nonzero_mask & ~outcome_mask
id2process[group_idx].append(token_level_rewards[i][process_mask[i]])
Summary
In this post, we discuss possible ways of obtaining a training-free process reward (that does not rely on an external reward model) and walked through the implementation details in VeRL.
A few takeaways:
Log-probability approximation is efficient: Estimating V(prefix) via $\log P(\text{answer} \mid \text{prefix})$ requires one forward pass, vs K×N rollouts for MC estimation
Marginal utility captures step-level progress: $U_i = V(\text{prefix}i) - V(\text{prefix}{i-1})$ measures how much each episode helps or hurts
Choose the right reward manager architecture: Legacy
RewardManagerworks better when rewards depend on the current policy;RewardLoopManager(now default) is designed for external reward modelsPre-compute during generation: Move V(prefix) computation out of the training critical path for 60-70% speedup
Use token-level advantage estimators: Standard GRPO collapses to scalars—use GAE or token-preserving GRPO
Normalize reward types separately: Mixed-scale rewards (outcome + process) need separate normalization to preserve signal strength
References
[1] Lightman, Hunter, et al. “Let’s verify step by step.” The Twelfth International Conference on Learning Representations. 2023. https://arxiv.org/pdf/2305.20050
[2] Kazemnejad et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. https://arxiv.org/abs/2410.01679
[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl
[4] Setlur et al. (2024). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. https://arxiv.org/abs/2410.08146
[5] Qu et al. (2025). Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning. https://arxiv.org/abs/2503.07572