Skip to content

Skills Dojo

Test and improve your AI agent skills with automated evals

Selection Evals

Does the agent pick the right skill? Dojo watches the agent’s tool calls and checks.

Effectiveness Evals

Does the skill actually work? Dojo evaluates how well an agent uses a skill using LLM-as-a-Judge scoring.

Variant Testing

A/B test different skill variations to find the optimal phrasing and instructions for your agent.

Reporting

Every run saves a detailed report with the agent’s reasoning, so you can review evals and debug failures.

Eval Sandboxing

All skill Evals run in an isolated sandbox environment for security and eval consistency.

Multi-Provider Support

Anthropic, OpenAI, GitHub Copilot, and Vercel AI SDK currently supported.

Dojo is a CLI that automates the tedious parts of evaluating agent skills. You write your evals as YAML, and Dojo handles the rest: presenting prompts to the evaluator model, checking results against your assertions, and saving reports.

You can test variants of skill descriptions, add decoys to stress-test discrimination, and verify that the agent knows when no skill is needed at all.

Skills Dojo assumes your agent skills follow the agent skills specification.

Vercel found that agents fail to select the right skill about 21% of the time. That’s roughly 1 in 5 situations where the agent should have loaded the appropriate skill but didn’t.

But cramming everything into a single AGENTS.md file isn’t the answer either. Skills exist for a reason — they let you encode focused, domain-specific instructions that agents can load on demand.

Evals let you iterate on your skill descriptions and instructions until the agent reliably picks the right tool and uses it effectively. Without measurement, you’re guessing.

Vercel Report: AGENTS.md outperforms skills in our agent evals

  1. Define skills — Each skill is a directory with a SKILL.md file containing frontmatter (name, description) and instructions for the agent.
  2. Write evals — YAML files in a skill’s evals/ directory define prompts, variants, and expected outcomes.
  3. Run — Dojo sends each prompt to an evaluator model along with the available skills. For selection evals, it tracks which skill the model picks. For effectiveness evals, it runs the agent in a sandbox and judges the output.
  4. Check — Selection evals compare the agent’s choice against your assertion. Effectiveness evals check whether the judge’s scores meet your pass thresholds.
  5. Analyze — Every run saves a report with model reasoning logs so you can understand why the agent made its choices.
v0.6.1