What’s in Pass@K?
Published:
TL;DR
- Computing pass@k requires sampling N >= k responses and applying an unbiased combinatorial estimator — not simply sampling k times and counting.
- Pass@k vs. pass@1: pass@k is a nonlinear function of the pass rate p, which saturates on easy problems. Ranking models by pass@k effectively upweights hard problems.
- Evaluation vs. checkpoint selection: For evaluation, use large N and moderate k for stability. For checkpoint selection, you only need rankings — estimate $\hat{p} = c/N$ from moderate N and extrapolate to large k via the Bernoulli formula.
- Efficient estimation for large k: Use dynamic N (sample more on hard problems) and fit a Beta distribution to the difficulty distribution for stable extrapolation from limited samples.
- Pass@k as RL reward: It works, but is compute-inefficient — you spend equal compute on all problems then downweight easy ones. Better to upsample hard questions or allocate larger group sizes to them directly.
Computing Pass@k
Pass@k measures the probability that at least one of k sampled responses is correct. The standard way to estimate it [1] is:
- Sample N responses from the model (e.g. at temperature 0.6).
- Count the number of correct responses c.
- Compute pass@k using the unbiased estimator:
This estimator works for any N >= k. It is equivalent to: out of all ways to choose k responses from N total, what fraction includes at least one correct response?
Common mistake: Sample only k responses and report c/k as pass@k. This gives a biased, high-variance estimate. The unbiased estimator requires N >= k, and using larger N reduces variance, giving a more stable estimate of pass@k.
Pass@1 vs. Pass@k
At first glance, the two metrics seem to carry the same information. For a single problem, both pass@1 and pass@k are monotonically increasing in c (the number of correct responses out of N). And for reasonably large N, both give stable estimates of the model’s performance under random sampling.
The key difference is in how they aggregate across problems. Let p be the per-problem pass rate (i.e. the probability of a single sample being correct). Then:
- Pass@1 is linear in p: $\text{pass@1} = p$
- Pass@k is nonlinear in p: $\text{pass@k} = 1 - (1-p)^k$
As k grows, problems with high p see their pass@k saturate toward 1. A problem with p = 0.8 and a problem with p = 0.95 both have pass@k ≈ 1 for large k — the difference between them is effectively erased.
Meanwhile, problems with low p remain far from saturation. A problem with p = 0.05 has pass@10 ≈ 0.40, while p = 0.15 gives pass@10 ≈ 0.80. The gap is amplified.
Figure: Pass@1 (blue) and pass@10 (orange) for 10 problems sorted by difficulty. On easy problems (left), pass@10 saturates near 1 while pass@1 still varies. On hard problems (right), the gap between the two metrics widens. The dashed lines show the benchmark-level averages — pass@10 is dominated by hard problems where the model still has room to improve.
The consequence: ranking models by pass@k effectively upweights hard problems. Improvements on easy problems (where both models already have high p) barely move the pass@k number, while improvements on hard problems (where p is low) show up clearly. This makes pass@k a useful complement to pass@1 when you care about a model’s ability to solve difficult tasks given multiple attempts.
Pass@k for Model Evaluation vs. Checkpoint Selection
The choice of N and k depends on what you’re using pass@k for. Two common use cases have quite different requirements.
Model evaluation. The goal is to report a stable number that reflects how the model performs in practice. Users typically sample once (or a handful of times), so what matters is pass@1 or pass@k for small k. The main concern is stability: you want large N so that the estimate of the pass rate p is precise, but k itself can stay moderate. Asymptotic correctness — what the model could do given many attempts — is less important, because it doesn’t match how the model is actually used.
A common misconception: Some papers justify using pass@k (with large k) as an evaluation metric because it “reflects generation diversity.” This conflates two things. Generation diversity is a property of the model’s output distribution, which is captured by the pass rate p estimated from large N. Large k doesn’t help you measure diversity — it just applies a nonlinear transform that compresses differences at the top of the distribution. To evaluate distributional properties, you need large N, not large k.
Checkpoint selection. The goal is different: you want to pick the best pretrain checkpoint for SFT, or the best SFT checkpoint for RL. Here you care about the potential of a checkpoint — not its single-shot performance, but whether it can solve the problem at all. This calls for large k, because pass@k with large k measures “does the model have this capability somewhere in its distribution?”
The cost structure also differs. For checkpoint selection, you only need the ranking between checkpoints, not the absolute score. This means we can go further than just using moderate N — we can drop the unbiased combinatorial estimator entirely and use the Bernoulli formula directly:
\[\text{pass@k} = 1 - (1 - \hat{p})^k, \quad \hat{p} = c / N\]This sidesteps the N >= k requirement altogether. We estimate the pass rate $\hat{p}$ from N samples, and then extrapolate to any k we want. The estimate of $\hat{p}$ from moderate N (e.g. N = 32) is noisy, so the absolute pass@k values won’t be precise — but that’s fine, because we only need the ranking between checkpoints.
| Model Evaluation | Checkpoint Selection | |
|---|---|---|
| What matters | Stability of the score | Ranking between checkpoints |
| k | Small to moderate | Large |
| N | Large (for precise p) | Moderate (e.g. 32) is often sufficient |
| Why | Reflects real usage (few samples) | Measures potential / capability |
Efficient estimation of pass@k for large k
When k is large, estimating pass@k per problem is wasteful if done uniformly. Two ideas can help.
Dynamic N per problem. Not all problems need the same sampling budget. For easy problems, a small N already gives a confident estimate of $\hat{p}$. For hard problems — where $\hat{p}$ is close to 0 — the estimate is dominated by whether you observe any correct response at all. A practical strategy is to keep expanding N for hard problems until you observe 1–2 correct generations, then stop [6]. This concentrates compute where it matters most: on the hard tail of the difficulty distribution, which is exactly the region that governs pass@k for large k.
Fitting a difficulty distribution. Rather than estimating pass@k per problem independently, we can model the distribution of pass rates across problems. Kazdan et al. [5] propose fitting a Beta distribution to the problem-level pass rates, then computing the expected pass@k under this distribution analytically. The Beta-Binomial model lets you estimate pass@k scaling from limited samples — you fit the Beta parameters $(\alpha, \beta)$ to the observed (successes, trials) counts across problems, and extrapolate to large k without ever sampling that many times. Combined with dynamic sampling (allocating more budget to hard problems), this gives reliable pass@k estimates at a fraction of the uniform-sampling cost.
Left: Beta distributions fitted to true pass rates (blue) and noisy N=16 estimates (orange dashed). Right: pass@k scaling estimated four ways — ground truth (black), Beta fit from true p (blue dashed), Beta fit from N=16 (orange dashed), and per-problem Bernoulli from N=16 (grey dotted).
Analysis of Using Pass@k as an RL Reward
Even if the end goal is pass@1 performance, there’s a reason to care about hard problems during RL training. Training on easy (nearly solved) problems sharpens an already confident distribution further, reducing the model’s entropy. This impairs the model’s ability to explore new solutions and hurts its performance on hard problems. Downweighting easy examples helps preserve exploration capacity, especially in the early stages of RL.
Given that pass@k upweights hard problems, a natural idea is to use pass@k as an RL reward. The mechanism [2] works as follows: divide the N rollouts for a problem into groups of k, and assign each group a reward equal to the maximum reward within that group (i.e. 1 if any response in the group is correct, 0 otherwise). All responses in the same group receive the same reward. This is effectively a Monte Carlo estimate of pass@k — and it inherits the same nonlinearity that upweights hard problems.
This can help in the early stages of training by encouraging exploration on hard problems. But there are two issues worth noting.
It’s an inefficient way to encourage exploration. You still spend roughly the same compute on every problem — generating the same number of rollouts regardless of difficulty. The pass@k reward then downweights easy problems after the fact: on easy problems, most groups contain at least one correct response, so the reward is 1 for nearly all groups and the advantages after baseline normalization are small. You’ve paid for those rollouts but get little learning signal from them. A more direct approach is to allocate resources differently upfront:
- Upsample hard questions. If you know which problems are hard (from pass rate estimates), sample them more frequently in training batches.
- Allocate more compute to hard problems. For example, use a larger group size for hard questions — generating more rollouts per problem gives the model more chances to find a correct solution and produces a richer advantage signal.
Both achieve the same goal of focusing learning on hard problems, but by directing compute where it matters rather than spending it uniformly and discounting it later. For recent work along these lines, see Knapsack RL [3] and AR3PO [4].
You eventually need to optimize pass@1. The end goal is single-shot performance — the model should reliably produce correct answers on the first try. Optimizing pass@k encourages the model to place some probability mass on correct solutions, but it doesn’t pressure the model to make the correct solution the most likely one. At some point, training must shift back to a pass@1-aligned objective, or the model may plateau with a spread-out distribution that solves problems occasionally but not consistently.
References
[1] Chen, Mark, et al. “Evaluating Large Language Models Trained on Code.” arXiv preprint arXiv:2107.03374 (2021).
[2] Chen, Zhipeng, et al. “Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.” arXiv preprint arXiv:2508.10751 (2025).
[3] Li, Ziniu, et al. “Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.” arXiv preprint arXiv:2509.25849 (2025).
[4] Zhang, Yuheng, et al. “Improving sampling efficiency in rlvr through adaptive rollout and response reuse.” arXiv preprint arXiv:2509.25808 (2025).
[5] Kazdan, Joshua, et al. “Efficient Prediction of Pass@k Scaling in Large Language Models.” arXiv preprint arXiv:2510.05197 (2025).
[6] Hu, Shengding, et al. “Predicting Emergent Abilities with Infinite Resolution Evaluation.” arXiv preprint arXiv:2310.03262 (2023).