Selection Evals

Evals are YAML files that live in a skill’s evals/ directory. Each file defines one or more selection eval scenarios that test whether an agent selects the correct skill given a prompt.

File structure

skills/
  sql-queries/
    SKILL.md
    evals/
      selection.yaml     # selection eval definitions
      reports/           # generated by `dojo run`
        <run-id>/
          report.json
          logs.json

Eval file schema

A selection eval file has file-level defaults and an array of individual evals:

# File-level defaults (apply to all evals unless overridden)
model: claude-sonnet-4-6              # optional -- evaluator model
timeout: 30                # seconds, default: 30
skills: all                # which skills to include: "all" or a list of names
run-mode: all              # "all" | "variants-only" | "current-only"
variants:                  # optional -- variant definitions (see Testing Variants guide)
  - name: concise
    value: "Write SQL queries"

# Individual evals
evals:
  - name: my-eval
    prompt: "Write a query to find duplicate emails"
    assert:
      - sql-queries

The `evals` array

Each entry in the evals array defines a single test scenario:

Field	Type	Default	Description
`name`	string	required	Unique name for the eval
`prompt`	string	required	The natural language prompt sent to the agent
`assert`	`string[]`, `"none"`, or `"any"`	the skill name	Expected outcome (see below)
`model`	string	file-level	Override the evaluator model for this eval
`timeout`	number	file-level	Timeout in seconds
`enabled`	boolean	`true`	Set to `false` to skip this eval
`skills`	`"all"` or `string[]`	file-level	Which skills to include in the selection pool
`run-mode`	string	file-level	`"all"`, `"variants-only"`, or `"current-only"`
`variants`	`"all"`, `string[]`, or inline variants	`"all"`	Which variants to run
`decoys`	array	none	Eval-level decoys (merged with variant decoys)

Assertions

The assert field controls what counts as a passing eval:

Assert that the agent selects one of the listed skills:

assert:
  - sql-queries

You can list multiple acceptable skills:

assert:
  - sql-queries
  - data-analysis

Assert that the agent does not select any skill:

assert: none

Use this for prompts that should not trigger skill loading — the agent should answer directly.

Assert that the agent selects some skill (you don’t care which one):

assert: any

If you omit assert, it defaults to the skill that owns the eval file. An eval at skills/sql-queries/evals/selection.yaml implicitly asserts ["sql-queries"].

How selection evals work

Dojo does not ask the agent “which skill would you pick?” — that would test prompt comprehension, not real behavior.

Instead, Dojo registers a load_skill tool with the evaluator model and sends your prompt along with the list of available skills (names and descriptions). The agent either:

Calls load_skill with a skill name — Dojo records which skill was selected
Responds directly without calling the tool — Dojo records that no skill was selected

This tests the agent’s actual decision-making behavior.

Decoys

Decoys are fake skills injected into the selection pool. They test whether the agent can discriminate between real skills and plausible-sounding distractors.

evals:
  - name: sql-with-decoys
    prompt: "Optimize this slow GROUP BY query"
    assert:
      - sql-queries
    decoys:
      - name: query-optimizer
        value: "Automatically optimizes database query execution plans"
      - name: data-formatter
        value: "Formats data output into tables and charts"

Each decoy has:

Field	Type	Default	Description
`name`	string	required	Name shown to the agent
`value`	string	required	Description shown to the agent
`enabled`	boolean	`true`	Set to `false` to disable this decoy

Decoys defined at the eval level are merged with any decoys defined on the variant. If both define a decoy with the same name, the variant’s decoy takes precedence.

Controlling which skills are available

By default, all discovered skills are included in the selection pool (skills: all). To restrict the pool to specific skills:

# File-level
skills:
  - sql-queries
  - code-review
  - data-analysis

evals:
  - name: narrow-pool
    prompt: "Review this PR for security issues"
    assert:
      - code-review

You can also override at the eval level:

evals:
  - name: only-two
    prompt: "Review this PR"
    skills:
      - code-review
      - sql-queries
    assert:
      - code-review

The skill pool always includes any decoys on top of the specified skills.

Run modes

Run modes control whether evals run against the current skill description, variants, or both. See Testing Variants for details.

Mode	Runs against current	Runs against variants
`all` (default)	Yes	Yes
`current-only`	Yes	No
`variants-only`	No	Yes

Filtering evals at runtime

You don’t need to edit YAML to run a subset of evals. Use CLI flags:

# Run evals for a specific skill
dojo run sql-queries

# Run a specific eval by name
dojo run --eval basic-select

# Run a specific variant
dojo run --variant concise

# Combine filters
dojo run sql-queries --eval basic-select --variant concise

Skill and eval name filters support glob patterns.

Debugging with inspect mode

Use --inspect to see the evaluator’s session events — model selection, prompt, tool calls, and errors:

dojo run --inspect

Full event streams are always written to logs.json in the run’s report directory, regardless of whether --inspect is used.

v0.6.1