Skip to main content
A metric takes the full list of EvalRow objects for a system and returns a dict[str, float] of summary statistics. Metrics run after all examples have been processed.

The Metric protocol

from typing import Protocol, runtime_checkable
from context_bench.results import EvalRow

@runtime_checkable
class Metric(Protocol):
    """Aggregates per-example scores into summary stats."""

    @property
    def name(self) -> str: ...

    def compute(self, rows: list[EvalRow]) -> dict[str, float]: ...
rows is a flat list of EvalRow dataclass instances, one per (system, example) pair. Access scores with row.scores["f1"], token counts with row.input_tokens / row.output_tokens, and timing with row.latency.

The score_field parameter

Several metrics accept a score_field parameter that selects which evaluator output to aggregate. For example, MeanScore(score_field="f1") averages the f1 field from AnswerQuality, while MeanScore(score_field="mc_accuracy") averages the mc_accuracy field from MultipleChoiceAccuracy. Common score fields:
FieldSource evaluator
f1AnswerQuality
exact_matchAnswerQuality
recallAnswerQuality
containsAnswerQuality
rouge_l_f1SummarizationQuality
mc_accuracyMultipleChoiceAccuracy
pass_at_1CodeExecution
math_equivMathEquivalence
nli_accuracyNLILabelMatch
ifeval_strictIFEvalChecker
judge_scoreLLMJudge

Built-in metrics

MeanScore

Average of score_field across all examples.
from context_bench.metrics import MeanScore

MeanScore(score_field="f1")
# Returns: {"mean_score": 0.742}

PassRate

Fraction of examples where score_field exceeds threshold.
from context_bench.metrics import PassRate

PassRate(score_field="f1", threshold=0.7)
# Returns: {"pass_rate": 0.61}
The default threshold is 0.7. Override with --threshold on the CLI.

CompressionRatio

Measures how much the system reduces token count: 1 - (output_tokens / input_tokens). Positive values mean compression; negative values mean expansion.
from context_bench.metrics import CompressionRatio

CompressionRatio()
# Returns: {
#   "compression_ratio": 0.42,
#   "mean_input_tokens": 3200.0,
#   "mean_output_tokens": 1856.0,
# }
Token counting uses tiktoken (cl100k_base) by default. Swap the tokenizer via context_bench.utils.tokens.

CostOfPass

Output tokens spent per successful completion: total_output_tokens / num_passing_examples. Lower is better. Based on the methodology from arXiv:2504.13359.
from context_bench.metrics import CostOfPass

CostOfPass(score_field="f1", threshold=0.7)
# Returns: {"cost_of_pass": 2447.0}
Use this alongside PassRate to understand the efficiency trade-off: a system that passes more examples cheaply beats one that passes the same number expensively.

Latency

Per-example wall-clock timing. Reports four percentiles.
from context_bench.metrics import Latency

Latency()
# Returns: {
#   "latency_mean": 1.23,
#   "latency_median": 1.10,
#   "latency_p95": 2.81,
#   "latency_p99": 3.44,
# }
Latency is measured from the start of System.process() to the return of the result.

PerDatasetBreakdown

Mean score sliced by dataset tag. Automatically enabled when you load more than one dataset.
from context_bench.metrics import PerDatasetBreakdown

PerDatasetBreakdown(score_field="f1")
# Returns: {
#   "dataset:hotpotqa": 0.71,
#   "dataset:gsm8k": 0.88,
# }
Each EvalRow carries a dataset field set from example["dataset"]. Rows without a dataset tag appear under "dataset:unknown".

ParetoRank

Ranks systems on the quality-vs-cost Pareto frontier. Automatically enabled for multi-system CLI runs.
from context_bench.metrics.token_stats import ParetoRank

# Per-system compute() returns a placeholder 0.0.
# Use rank_systems() for the actual ranking after evaluate():
ranks = ParetoRank.rank_systems(
    result.summary,
    quality_field="mean_score",
    cost_field="cost_of_pass",
)
# {'kompact': 1, 'baseline': 2}
A system is Pareto-dominant (rank 1) if no other system beats it on both quality and token cost simultaneously. The constructor accepts quality_field (default "score") and cost_field (default "cost_of_pass").

Using metrics

Pass metrics as a list to evaluate():
from context_bench import evaluate
from context_bench.metrics import MeanScore, PassRate, CompressionRatio, Latency

result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[
        MeanScore(score_field="f1"),
        PassRate(score_field="f1", threshold=0.7),
        CompressionRatio(),
        Latency(),
    ],
)
print(result.summary)
# {"my-system": {"mean_score": 0.74, "pass_rate": 0.61, "compression_ratio": 0.42, ...}}

Implementing a custom metric

Any class with name and compute() satisfies the protocol:
class MaxScore:
    name = "max-score"

    def __init__(self, score_field: str = "f1"):
        self.score_field = score_field

    def compute(self, rows):
        scores = [r.scores.get(self.score_field, 0.0) for r in rows]
        return {"max_score": max(scores) if scores else 0.0}
Pass it alongside built-in metrics:
result = evaluate(
    ...,
    metrics=[MeanScore(score_field="f1"), MaxScore(score_field="f1")],
)
Metric is a typing.Protocol. You do not need to import or subclass anything from context-bench to define a custom metric.

Build docs developers (and LLMs) love