Effectiveness Evals

A skill can be perfectly selected but still produce garbage output. Selection evals verify that the agent picks the right skill — effectiveness evals verify that the skill actually helps the agent do the job correctly.

Effectiveness evals work by running an agent in a sandboxed subprocess with the skill preloaded, then having an LLM judge score the output against criteria you define.

File structure

skills/
  sql-queries/
    SKILL.md
    evals/
      effectiveness.yaml     # effectiveness evals
      fixtures/
        basic-join/
          tests/             # copied into sandbox as agent's cwd
            schema.sql
            setup.sh         # optional: runs before the agent
          golden/
            notes.md         # optional: freeform reference for judge
            files/           # optional: reference file state
              expected.sql
      reports/
        <run-id>/
          report.json
          logs.json

The effectiveness.yaml file

timeout: 180                        # default timeout per eval (seconds)
model: anthropic/claude-sonnet-4-6  # default evaluator model
judge: anthropic/claude-opus-4-6    # default judge model

variants:
  - name: verbose-instructions
    value: |
      Write correct, performant SQL with verbose instructions including
      CTE conventions, explicit JOINs, and dialect-specific guidance.

evals:
  - name: basic-join
    prompt: "Write a SQL query that joins users and orders tables on user_id"
    criteria:
      - name: correct-join-syntax
        description: "The generated SQL uses valid JOIN syntax that would execute without errors."
        pass_threshold: 0.8
      - name: uses-explicit-join
        description: "The query uses an explicit JOIN keyword rather than implicit comma-separated table joins."
        pass_threshold: 0.7
    fixtures:
      - basic-join

How the sandbox works

Each effectiveness eval spawns a subprocess where the agent runs with:

The full skill directory available (SKILL.md, scripts/, references/, assets/)
Real tools: bash, read_file, write_file, list_files
A temp directory as its working directory, populated from the fixture’s tests/ folder

The agent receives your prompt and works with the fixture files to produce output. After the agent finishes (or times out), the sandbox captures the final filesystem state for judging.

Fixtures

Fixtures provide the starting environment for the agent. Each fixture is a subdirectory under evals/fixtures/.

`tests/` directory

Files in tests/ are copied into the agent’s working directory before it runs. This is where you place input files the agent needs to work with.

If a tests/setup.sh file exists, it runs before the agent starts. Use it to create databases, generate files, or set up state that can’t be expressed as static files.

`golden/` directory

The golden/ directory provides reference material to the judge — not to the agent. The agent never sees these files.

golden/notes.md — Freeform notes for the judge explaining what correct output looks like, edge cases to watch for, or grading guidance.
golden/files/ — Reference file state. The judge can compare the agent’s output against these files.

Criteria

Each eval defines one or more criteria that the judge evaluates. All criteria must pass for the eval to pass.

criteria:
  - name: correct-output
    description: "The agent produces output that matches the expected result for the given input."
    pass_threshold: 0.9
  - name: no-hardcoded-values
    description: "The solution uses variables or parameters rather than hardcoded literals."
    pass_threshold: 0.7

The pass_threshold is a score from 0 to 1. The judge assigns a score for each criterion, and the criterion passes if the score meets or exceeds the threshold.

The description tells the judge exactly what to evaluate. Be specific — vague descriptions lead to inconsistent scoring.

Matrix mode

Matrix mode runs your evals across multiple evaluator models and judges. This tests whether your skill works across different models and whether scoring is consistent across judges.

matrix:
  evaluators:
    - provider: anthropic
      model: claude-sonnet-4-6
    - provider: openai
      model: gpt-4o
  judges:
    - provider: anthropic
      model: claude-opus-4-6

evals:
  - name: my-eval
    prompt: "..."
    criteria:
      - name: correct
        description: "The agent produces the correct output for the given prompt."
        pass_threshold: 0.8

The matrix expands as: each eval x each fixture x each evaluator. Agent runs are independent of judges — one agent run fans out to all judges for scoring. This means adding judges is cheap (no additional agent runs).

You can override the matrix at the eval level:

evals:
  - name: expensive-eval
    prompt: "..."
    criteria:
      - name: correct
        description: "The agent produces the correct output for the given prompt."
        pass_threshold: 0.9
    matrix:
      evaluators:
        - provider: anthropic
          model: claude-sonnet-4-6
      judges:
        - provider: anthropic
          model: claude-opus-4-6

Variants

Effectiveness evals support variants to test different versions of your skill. There are two approaches: inline variants (quick, description-only) and filesystem variants (full skill directories).

Filesystem variants

Place full agentskills.io skill directories under evals/variants/:

evals/
  variants/
    verbose-guide/
      SKILL.md
      scripts/
        lint-sql.sh
      references/
        style-guide.md
    minimal/
      SKILL.md
  fixtures/
    basic-join/
      tests/
        schema.sql

Each variant directory must contain a SKILL.md. It can also include scripts/, references/, and assets/ — the full progressive disclosure model. The directory name is the variant’s identifier.

Inline variants

For quick tests where only the SKILL.md content differs, define variants inline:

variants:
  - name: detailed-examples
    value: |
      Write SQL with detailed examples showing CTEs, window functions, and joins.
  - name: minimal
    value: "Write SQL queries."

evals:
  - name: my-eval
    prompt: "..."
    criteria:
      - name: correct
        description: "The agent produces the correct output for the given prompt."
        pass_threshold: 0.8
    variants: all              # run against all variants (default)

Inline variants cannot carry scripts, references, or assets. Use filesystem variants when you need the full skill directory.

Coexistence and collision

Inline and filesystem variants can coexist. If both define a variant with the same name, the filesystem variant wins and a warning is emitted.

Restricting variants per eval

You can restrict to specific variants:

    variants:
      - detailed-examples

Run mode

The run-mode field controls which runs execute:

Mode	Runs `[current]`	Runs variants	Use case
`all` (default)	Yes	Yes	Full coverage
`current-only`	Yes	No	Quick baseline check
`variants-only`	No	Yes	Focus on variant comparison

Set at file level or per-eval (per-eval wins):

run-mode: variants-only

evals:
  - name: my-eval
    prompt: "..."
    run-mode: all   # override
    criteria:
      - name: correct
        description: "Correct output"
        pass_threshold: 0.8

Variant validation

Invalid variant SKILL.md files emit a warning and are skipped — they do not fail the run. This supports iterative development of new variants alongside stable ones.

Runtime filtering

You don’t need to edit YAML to run a subset of effectiveness evals. Use CLI flags:

# Run only effectiveness evals
dojo run --eval-type effectiveness

# Run a specific fixture
dojo run --fixture basic-join

# Run a specific variant
dojo run --variant verbose-guide

# Run only the baseline (no variants)
dojo run --variant current

# Run with a specific judge
dojo run --judge-filter anthropic/claude-opus-4-6

# Combine filters
dojo run sql-queries --eval-type effectiveness --fixture basic-join

# Keep sandbox directories for debugging
dojo run --eval-type effectiveness --keep-sandbox

Cost guardrails

Effectiveness evals are more expensive than selection evals — each run involves an agent session plus one or more judge calls. Matrix mode multiplies this further.

Dojo includes built-in guardrails:

Warning threshold (default: 4 fixtures per skill) — Prints a warning when a skill has many fixtures.
Confirmation threshold (default: 12 fixtures per skill) — Requires --yes to proceed without interactive confirmation.

Configure these in dojo.toml:

[effectiveness]
warn_fixture_threshold = 4
confirm_fixture_threshold = 12

Or skip confirmation entirely:

dojo run --yes

v0.6.1