Skip to content

Writing Evals

Evals are YAML files that live in a skill’s evals/ directory. Each file defines one or more selection eval scenarios that test whether an agent selects the correct skill given a prompt.

skills/
sql-queries/
SKILL.md
evals/
selection.yaml # selection eval definitions
reports/ # generated by `dojo run`
<run-id>/
report.json
logs.json

A selection eval file has file-level defaults and an array of individual evals:

# File-level defaults (apply to all evals unless overridden)
model: gpt-4o # optional -- evaluator model
timeout: 30 # seconds, default: 30
skills: all # which skills to include: "all" or a list of names
run-mode: all # "all" | "variants-only" | "current-only"
variants: # optional -- variant definitions (see Testing Variants guide)
- name: concise
value: "Write SQL queries"
# Individual evals
evals:
- name: my-eval
prompt: "Write a query to find duplicate emails"
assert:
- sql-queries

Each entry in the evals array defines a single test scenario:

FieldTypeDefaultDescription
namestringrequiredUnique name for the eval
promptstringrequiredThe natural language prompt sent to the agent
assertstring[], "none", or "any"the skill nameExpected outcome (see below)
modelstringfile-levelOverride the evaluator model for this eval
timeoutnumberfile-levelTimeout in seconds
enabledbooleantrueSet to false to skip this eval
skills"all" or string[]file-levelWhich skills to include in the selection pool
run-modestringfile-level"all", "variants-only", or "current-only"
variants"all", string[], or inline variants"all"Which variants to run
decoysarraynoneEval-level decoys (merged with variant decoys)

The assert field controls what counts as a passing eval:

Assert that the agent selects one of the listed skills:

assert:
- sql-queries

You can list multiple acceptable skills:

assert:
- sql-queries
- data-analysis

Dojo does not ask the agent “which skill would you pick?” — that would test prompt comprehension, not real behavior.

Instead, Dojo registers a load_skill tool with the evaluator model and sends your prompt along with the list of available skills (names and descriptions). The agent either:

  • Calls load_skill with a skill name — Dojo records which skill was selected
  • Responds directly without calling the tool — Dojo records that no skill was selected

This tests the agent’s actual decision-making behavior.

Decoys are fake skills injected into the selection pool. They test whether the agent can discriminate between real skills and plausible-sounding distractors.

evals:
- name: sql-with-decoys
prompt: "Optimize this slow GROUP BY query"
assert:
- sql-queries
decoys:
- name: query-optimizer
value: "Automatically optimizes database query execution plans"
- name: data-formatter
value: "Formats data output into tables and charts"

Each decoy has:

FieldTypeDefaultDescription
namestringrequiredName shown to the agent
valuestringrequiredDescription shown to the agent
enabledbooleantrueSet to false to disable this decoy

Decoys defined at the eval level are merged with any decoys defined on the variant. If both define a decoy with the same name, the variant’s decoy takes precedence.

By default, all discovered skills are included in the selection pool (skills: all). To restrict the pool to specific skills:

# File-level
skills:
- sql-queries
- code-review
- data-analysis
evals:
- name: narrow-pool
prompt: "Review this PR for security issues"
assert:
- code-review

You can also override at the eval level:

evals:
- name: only-two
prompt: "Review this PR"
skills:
- code-review
- sql-queries
assert:
- code-review

The skill pool always includes any decoys on top of the specified skills.

Run modes control whether evals run against the current skill description, variants, or both. See Testing Variants for details.

ModeRuns against currentRuns against variants
all (default)YesYes
current-onlyYesNo
variants-onlyNoYes

You don’t need to edit YAML to run a subset of evals. Use CLI flags:

Terminal window
# Run evals for a specific skill
dojo run sql-queries
# Run a specific eval by name
dojo run --eval basic-select
# Run a specific variant
dojo run --variant concise
# Combine filters
dojo run sql-queries --eval basic-select --variant concise

Skill and eval name filters support glob patterns.

Use --inspect to see the evaluator’s session events — model selection, prompt, tool calls, and errors:

Terminal window
dojo run --inspect

Full event streams are always written to logs.json in the run’s report directory, regardless of whether --inspect is used.

v0.3.3