Writing Evals
Evals are YAML files that live in a skill’s evals/ directory. Each file defines one or more selection eval scenarios that test whether an agent selects the correct skill given a prompt.
File structure
Section titled “File structure”skills/ sql-queries/ SKILL.md evals/ selection.yaml # selection eval definitions reports/ # generated by `dojo run` <run-id>/ report.json logs.jsonEval file schema
Section titled “Eval file schema”A selection eval file has file-level defaults and an array of individual evals:
# File-level defaults (apply to all evals unless overridden)model: gpt-4o # optional -- evaluator modeltimeout: 30 # seconds, default: 30skills: all # which skills to include: "all" or a list of namesrun-mode: all # "all" | "variants-only" | "current-only"variants: # optional -- variant definitions (see Testing Variants guide) - name: concise value: "Write SQL queries"
# Individual evalsevals: - name: my-eval prompt: "Write a query to find duplicate emails" assert: - sql-queriesThe evals array
Section titled “The evals array”Each entry in the evals array defines a single test scenario:
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Unique name for the eval |
prompt | string | required | The natural language prompt sent to the agent |
assert | string[], "none", or "any" | the skill name | Expected outcome (see below) |
model | string | file-level | Override the evaluator model for this eval |
timeout | number | file-level | Timeout in seconds |
enabled | boolean | true | Set to false to skip this eval |
skills | "all" or string[] | file-level | Which skills to include in the selection pool |
run-mode | string | file-level | "all", "variants-only", or "current-only" |
variants | "all", string[], or inline variants | "all" | Which variants to run |
decoys | array | none | Eval-level decoys (merged with variant decoys) |
Assertions
Section titled “Assertions”The assert field controls what counts as a passing eval:
Assert that the agent selects one of the listed skills:
assert: - sql-queriesYou can list multiple acceptable skills:
assert: - sql-queries - data-analysisAssert that the agent does not select any skill:
assert: noneUse this for prompts that should not trigger skill loading — the agent should answer directly.
Assert that the agent selects some skill (you don’t care which one):
assert: anyIf you omit assert, it defaults to the skill that owns the eval file. An eval at skills/sql-queries/evals/selection.yaml implicitly asserts ["sql-queries"].
How selection evals work
Section titled “How selection evals work”Dojo does not ask the agent “which skill would you pick?” — that would test prompt comprehension, not real behavior.
Instead, Dojo registers a load_skill tool with the evaluator model and sends your prompt along with the list of available skills (names and descriptions). The agent either:
- Calls
load_skillwith a skill name — Dojo records which skill was selected - Responds directly without calling the tool — Dojo records that no skill was selected
This tests the agent’s actual decision-making behavior.
Decoys
Section titled “Decoys”Decoys are fake skills injected into the selection pool. They test whether the agent can discriminate between real skills and plausible-sounding distractors.
evals: - name: sql-with-decoys prompt: "Optimize this slow GROUP BY query" assert: - sql-queries decoys: - name: query-optimizer value: "Automatically optimizes database query execution plans" - name: data-formatter value: "Formats data output into tables and charts"Each decoy has:
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Name shown to the agent |
value | string | required | Description shown to the agent |
enabled | boolean | true | Set to false to disable this decoy |
Decoys defined at the eval level are merged with any decoys defined on the variant. If both define a decoy with the same name, the variant’s decoy takes precedence.
Controlling which skills are available
Section titled “Controlling which skills are available”By default, all discovered skills are included in the selection pool (skills: all). To restrict the pool to specific skills:
# File-levelskills: - sql-queries - code-review - data-analysis
evals: - name: narrow-pool prompt: "Review this PR for security issues" assert: - code-reviewYou can also override at the eval level:
evals: - name: only-two prompt: "Review this PR" skills: - code-review - sql-queries assert: - code-reviewThe skill pool always includes any decoys on top of the specified skills.
Run modes
Section titled “Run modes”Run modes control whether evals run against the current skill description, variants, or both. See Testing Variants for details.
| Mode | Runs against current | Runs against variants |
|---|---|---|
all (default) | Yes | Yes |
current-only | Yes | No |
variants-only | No | Yes |
Filtering evals at runtime
Section titled “Filtering evals at runtime”You don’t need to edit YAML to run a subset of evals. Use CLI flags:
# Run evals for a specific skilldojo run sql-queries
# Run a specific eval by namedojo run --eval basic-select
# Run a specific variantdojo run --variant concise
# Combine filtersdojo run sql-queries --eval basic-select --variant conciseSkill and eval name filters support glob patterns.
Debugging with inspect mode
Section titled “Debugging with inspect mode”Use --inspect to see the evaluator’s session events — model selection, prompt, tool calls, and errors:
dojo run --inspectFull event streams are always written to logs.json in the run’s report directory, regardless of whether --inspect is used.