Effectiveness Evals
A skill can be perfectly selected but still produce garbage output. Selection evals verify that the agent picks the right skill — effectiveness evals verify that the skill actually helps the agent do the job correctly.
Effectiveness evals work by running an agent in a sandboxed subprocess with the skill preloaded, then having an LLM judge score the output against criteria you define.
File structure
Section titled “File structure”skills/ sql-queries/ SKILL.md evals/ effectiveness.yaml # effectiveness evals fixtures/ basic-join/ tests/ # copied into sandbox as agent's cwd schema.sql setup.sh # optional: runs before the agent golden/ notes.md # optional: freeform reference for judge files/ # optional: reference file state expected.sql reports/ <run-id>/ report.json logs.jsonThe effectiveness.yaml file
Section titled “The effectiveness.yaml file”timeout: 180 # default timeout per eval (seconds)model: anthropic/claude-sonnet-4-6 # default evaluator modeljudge: anthropic/claude-opus-4-6 # default judge model
variants: - name: verbose-instructions value: | Write correct, performant SQL with verbose instructions including CTE conventions, explicit JOINs, and dialect-specific guidance.
evals: - name: basic-join prompt: "Write a SQL query that joins users and orders tables on user_id" criteria: - name: correct-join-syntax description: "The generated SQL uses valid JOIN syntax that would execute without errors." pass_threshold: 0.8 - name: uses-explicit-join description: "The query uses an explicit JOIN keyword rather than implicit comma-separated table joins." pass_threshold: 0.7 fixtures: - basic-joinHow the sandbox works
Section titled “How the sandbox works”Each effectiveness eval spawns a subprocess where the agent runs with:
- The full skill directory available (SKILL.md, scripts/, references/, assets/)
- Real tools:
bash,read_file,write_file,list_files - A temp directory as its working directory, populated from the fixture’s
tests/folder
The agent receives your prompt and works with the fixture files to produce output. After the agent finishes (or times out), the sandbox captures the final filesystem state for judging.
Fixtures
Section titled “Fixtures”Fixtures provide the starting environment for the agent. Each fixture is a subdirectory under evals/fixtures/.
tests/ directory
Section titled “tests/ directory”Files in tests/ are copied into the agent’s working directory before it runs. This is where you place input files the agent needs to work with.
If a tests/setup.sh file exists, it runs before the agent starts. Use it to create databases, generate files, or set up state that can’t be expressed as static files.
golden/ directory
Section titled “golden/ directory”The golden/ directory provides reference material to the judge — not to the agent. The agent never sees these files.
golden/notes.md— Freeform notes for the judge explaining what correct output looks like, edge cases to watch for, or grading guidance.golden/files/— Reference file state. The judge can compare the agent’s output against these files.
Criteria
Section titled “Criteria”Each eval defines one or more criteria that the judge evaluates. All criteria must pass for the eval to pass.
criteria: - name: correct-output description: "The agent produces output that matches the expected result for the given input." pass_threshold: 0.9 - name: no-hardcoded-values description: "The solution uses variables or parameters rather than hardcoded literals." pass_threshold: 0.7The pass_threshold is a score from 0 to 1. The judge assigns a score for each criterion, and the criterion passes if the score meets or exceeds the threshold.
The description tells the judge exactly what to evaluate. Be specific — vague descriptions lead to inconsistent scoring.
Matrix mode
Section titled “Matrix mode”Matrix mode runs your evals across multiple evaluator models and judges. This tests whether your skill works across different models and whether scoring is consistent across judges.
matrix: evaluators: - provider: anthropic model: claude-sonnet-4-6 - provider: openai model: gpt-4o judges: - provider: anthropic model: claude-opus-4-6
evals: - name: my-eval prompt: "..." criteria: - name: correct description: "The agent produces the correct output for the given prompt." pass_threshold: 0.8The matrix expands as: each eval x each fixture x each evaluator. Agent runs are independent of judges — one agent run fans out to all judges for scoring. This means adding judges is cheap (no additional agent runs).
You can override the matrix at the eval level:
evals: - name: expensive-eval prompt: "..." criteria: - name: correct description: "The agent produces the correct output for the given prompt." pass_threshold: 0.9 matrix: evaluators: - provider: anthropic model: claude-sonnet-4-6 judges: - provider: anthropic model: claude-opus-4-6Variants
Section titled “Variants”Effectiveness evals support variants to test different versions of your skill. There are two approaches: inline variants (quick, description-only) and filesystem variants (full skill directories).
Filesystem variants
Section titled “Filesystem variants”Place full agentskills.io skill directories under evals/variants/:
evals/ variants/ verbose-guide/ SKILL.md scripts/ lint-sql.sh references/ style-guide.md minimal/ SKILL.md fixtures/ basic-join/ tests/ schema.sqlEach variant directory must contain a SKILL.md. It can also include scripts/, references/, and assets/ — the full progressive disclosure model. The directory name is the variant’s identifier.
Inline variants
Section titled “Inline variants”For quick tests where only the SKILL.md content differs, define variants inline:
variants: - name: detailed-examples value: | Write SQL with detailed examples showing CTEs, window functions, and joins. - name: minimal value: "Write SQL queries."
evals: - name: my-eval prompt: "..." criteria: - name: correct description: "The agent produces the correct output for the given prompt." pass_threshold: 0.8 variants: all # run against all variants (default)Inline variants cannot carry scripts, references, or assets. Use filesystem variants when you need the full skill directory.
Coexistence and collision
Section titled “Coexistence and collision”Inline and filesystem variants can coexist. If both define a variant with the same name, the filesystem variant wins and a warning is emitted.
Restricting variants per eval
Section titled “Restricting variants per eval”You can restrict to specific variants:
variants: - detailed-examplesRun mode
Section titled “Run mode”The run-mode field controls which runs execute:
| Mode | Runs [current] | Runs variants | Use case |
|---|---|---|---|
all (default) | Yes | Yes | Full coverage |
current-only | Yes | No | Quick baseline check |
variants-only | No | Yes | Focus on variant comparison |
Set at file level or per-eval (per-eval wins):
run-mode: variants-only
evals: - name: my-eval prompt: "..." run-mode: all # override criteria: - name: correct description: "Correct output" pass_threshold: 0.8Variant validation
Section titled “Variant validation”Invalid variant SKILL.md files emit a warning and are skipped — they do not fail the run. This supports iterative development of new variants alongside stable ones.
Runtime filtering
Section titled “Runtime filtering”You don’t need to edit YAML to run a subset of effectiveness evals. Use CLI flags:
# Run only effectiveness evalsdojo run --eval-type effectiveness
# Run a specific fixturedojo run --fixture basic-join
# Run a specific variantdojo run --variant verbose-guide
# Run only the baseline (no variants)dojo run --variant current
# Run with a specific judgedojo run --judge-filter anthropic/claude-opus-4-6
# Combine filtersdojo run sql-queries --eval-type effectiveness --fixture basic-join
# Keep sandbox directories for debuggingdojo run --eval-type effectiveness --keep-sandboxCost guardrails
Section titled “Cost guardrails”Effectiveness evals are more expensive than selection evals — each run involves an agent session plus one or more judge calls. Matrix mode multiplies this further.
Dojo includes built-in guardrails:
- Warning threshold (default: 4 fixtures per skill) — Prints a warning when a skill has many fixtures.
- Confirmation threshold (default: 12 fixtures per skill) — Requires
--yesto proceed without interactive confirmation.
Configure these in dojo.toml:
[effectiveness]warn_fixture_threshold = 4confirm_fixture_threshold = 12Or skip confirmation entirely:
dojo run --yes