Skip to content

Effectiveness Evals

A skill can be perfectly selected but still produce garbage output. Selection evals verify that the agent picks the right skill — effectiveness evals verify that the skill actually helps the agent do the job correctly.

Effectiveness evals work by running an agent in a sandboxed subprocess with the skill preloaded, then having an LLM judge score the output against criteria you define.

skills/
sql-queries/
SKILL.md
evals/
effectiveness.yaml # effectiveness evals
fixtures/
basic-join/
tests/ # copied into sandbox as agent's cwd
schema.sql
setup.sh # optional: runs before the agent
golden/
notes.md # optional: freeform reference for judge
files/ # optional: reference file state
expected.sql
reports/
<run-id>/
report.json
logs.json
timeout: 180 # default timeout per eval (seconds)
model: anthropic/claude-sonnet-4-6 # default evaluator model
judge: anthropic/claude-opus-4-6 # default judge model
variants:
- name: verbose-instructions
value: |
Write correct, performant SQL with verbose instructions including
CTE conventions, explicit JOINs, and dialect-specific guidance.
evals:
- name: basic-join
prompt: "Write a SQL query that joins users and orders tables on user_id"
criteria:
- name: correct-join-syntax
description: "The generated SQL uses valid JOIN syntax that would execute without errors."
pass_threshold: 0.8
- name: uses-explicit-join
description: "The query uses an explicit JOIN keyword rather than implicit comma-separated table joins."
pass_threshold: 0.7
fixtures:
- basic-join

Each effectiveness eval spawns a subprocess where the agent runs with:

  • The full skill directory available (SKILL.md, scripts/, references/, assets/)
  • Real tools: bash, read_file, write_file, list_files
  • A temp directory as its working directory, populated from the fixture’s tests/ folder

The agent receives your prompt and works with the fixture files to produce output. After the agent finishes (or times out), the sandbox captures the final filesystem state for judging.

Fixtures provide the starting environment for the agent. Each fixture is a subdirectory under evals/fixtures/.

Files in tests/ are copied into the agent’s working directory before it runs. This is where you place input files the agent needs to work with.

If a tests/setup.sh file exists, it runs before the agent starts. Use it to create databases, generate files, or set up state that can’t be expressed as static files.

The golden/ directory provides reference material to the judge — not to the agent. The agent never sees these files.

  • golden/notes.md — Freeform notes for the judge explaining what correct output looks like, edge cases to watch for, or grading guidance.
  • golden/files/ — Reference file state. The judge can compare the agent’s output against these files.

Each eval defines one or more criteria that the judge evaluates. All criteria must pass for the eval to pass.

criteria:
- name: correct-output
description: "The agent produces output that matches the expected result for the given input."
pass_threshold: 0.9
- name: no-hardcoded-values
description: "The solution uses variables or parameters rather than hardcoded literals."
pass_threshold: 0.7

The pass_threshold is a score from 0 to 1. The judge assigns a score for each criterion, and the criterion passes if the score meets or exceeds the threshold.

The description tells the judge exactly what to evaluate. Be specific — vague descriptions lead to inconsistent scoring.

Matrix mode runs your evals across multiple evaluator models and judges. This tests whether your skill works across different models and whether scoring is consistent across judges.

matrix:
evaluators:
- provider: anthropic
model: claude-sonnet-4-6
- provider: openai
model: gpt-4o
judges:
- provider: anthropic
model: claude-opus-4-6
evals:
- name: my-eval
prompt: "..."
criteria:
- name: correct
description: "The agent produces the correct output for the given prompt."
pass_threshold: 0.8

The matrix expands as: each eval x each fixture x each evaluator. Agent runs are independent of judges — one agent run fans out to all judges for scoring. This means adding judges is cheap (no additional agent runs).

You can override the matrix at the eval level:

evals:
- name: expensive-eval
prompt: "..."
criteria:
- name: correct
description: "The agent produces the correct output for the given prompt."
pass_threshold: 0.9
matrix:
evaluators:
- provider: anthropic
model: claude-sonnet-4-6
judges:
- provider: anthropic
model: claude-opus-4-6

Effectiveness evals support variants to test different versions of your skill. There are two approaches: inline variants (quick, description-only) and filesystem variants (full skill directories).

Place full agentskills.io skill directories under evals/variants/:

evals/
variants/
verbose-guide/
SKILL.md
scripts/
lint-sql.sh
references/
style-guide.md
minimal/
SKILL.md
fixtures/
basic-join/
tests/
schema.sql

Each variant directory must contain a SKILL.md. It can also include scripts/, references/, and assets/ — the full progressive disclosure model. The directory name is the variant’s identifier.

For quick tests where only the SKILL.md content differs, define variants inline:

variants:
- name: detailed-examples
value: |
Write SQL with detailed examples showing CTEs, window functions, and joins.
- name: minimal
value: "Write SQL queries."
evals:
- name: my-eval
prompt: "..."
criteria:
- name: correct
description: "The agent produces the correct output for the given prompt."
pass_threshold: 0.8
variants: all # run against all variants (default)

Inline variants cannot carry scripts, references, or assets. Use filesystem variants when you need the full skill directory.

Inline and filesystem variants can coexist. If both define a variant with the same name, the filesystem variant wins and a warning is emitted.

You can restrict to specific variants:

variants:
- detailed-examples

The run-mode field controls which runs execute:

ModeRuns [current]Runs variantsUse case
all (default)YesYesFull coverage
current-onlyYesNoQuick baseline check
variants-onlyNoYesFocus on variant comparison

Set at file level or per-eval (per-eval wins):

run-mode: variants-only
evals:
- name: my-eval
prompt: "..."
run-mode: all # override
criteria:
- name: correct
description: "Correct output"
pass_threshold: 0.8

Invalid variant SKILL.md files emit a warning and are skipped — they do not fail the run. This supports iterative development of new variants alongside stable ones.

You don’t need to edit YAML to run a subset of effectiveness evals. Use CLI flags:

Terminal window
# Run only effectiveness evals
dojo run --eval-type effectiveness
# Run a specific fixture
dojo run --fixture basic-join
# Run a specific variant
dojo run --variant verbose-guide
# Run only the baseline (no variants)
dojo run --variant current
# Run with a specific judge
dojo run --judge-filter anthropic/claude-opus-4-6
# Combine filters
dojo run sql-queries --eval-type effectiveness --fixture basic-join
# Keep sandbox directories for debugging
dojo run --eval-type effectiveness --keep-sandbox

Effectiveness evals are more expensive than selection evals — each run involves an agent session plus one or more judge calls. Matrix mode multiplies this further.

Dojo includes built-in guardrails:

  • Warning threshold (default: 4 fixtures per skill) — Prints a warning when a skill has many fixtures.
  • Confirmation threshold (default: 12 fixtures per skill) — Requires --yes to proceed without interactive confirmation.

Configure these in dojo.toml:

[effectiveness]
warn_fixture_threshold = 4
confirm_fixture_threshold = 12

Or skip confirmation entirely:

Terminal window
dojo run --yes
v0.6.1