Skip to content

Effectiveness Evals Reference

Top-level schema for effectiveness.yaml files.

FieldTypeRequiredDefaultDescription
modelstringNo-Default evaluator model for all evals in this file.
judgestringNo-Default judge model (format: “provider/model”).
timeoutnumberNo120Default timeout in seconds.
matrixMatrixNo-Default matrix applied to all evals.
run-mode"all" | "variants-only" | "current-only"No"all"Default run mode for evals: “all”, “variants-only”, or “current-only”.
variantsVariant[]No-Variant definitions available to evals.
evalsEffectivenessEval[]Yes-List of effectiveness evals to run.
FieldTypeRequiredDefaultDescription
evaluatorsMatrixEntry[]No-Evaluator models to run the agent with.
judgesMatrixEntry[]No-Judge models to score the output.

Schema for individual entries in an effectiveness file’s evals array.

FieldTypeRequiredDefaultDescription
namestringYes-Unique name for this eval.
promptstringYes-The prompt to send to the agent in the sandbox.
enabledbooleanNotrueWhether this eval is active.
timeoutnumberNo-Timeout in seconds. Overrides file-level default.
fixturesstring[]No-Fixture names to run against. Default: all fixtures.
criteriaCriterion[]Yes-Criteria the judge evaluates. All must pass.
variants"all" | string[] | Variant[]No"all"Variants to run.
run-mode"all" | "variants-only" | "current-only"No-Controls which runs to perform: “all” runs current + variants, “variants-only” skips current, “current-only” skips variants.
matrixMatrixNo-Override the matrix for this eval.
FieldTypeRequiredDefaultDescription
evaluatorsMatrixEntry[]No-Evaluator models to run the agent with.
judgesMatrixEntry[]No-Judge models to score the output.

Schema for judging criteria in effectiveness evals.

FieldTypeRequiredDefaultDescription
namestringYes-Name of the criterion to evaluate.
descriptionstringYes-What the judge should evaluate for this criterion.
pass_thresholdnumberYes-Minimum score (0-1) for this criterion to pass.

Configuration for running evals across multiple models.

FieldTypeRequiredDefaultDescription
evaluatorsMatrixEntry[]No-Evaluator models to run the agent with.
judgesMatrixEntry[]No-Judge models to score the output.

A provider/model pair used in matrix configuration.

FieldTypeRequiredDefaultDescription
provider"copilot" | "openai" | "anthropic" | "vercel"Yes-The model provider to use.
modelstringYes-The model identifier (e.g. “claude-sonnet-4-6”, “gpt-4o”).
v0.6.1