Selection Evals
Does the agent pick the right skill? Dojo watches the agent’s tool calls and checks.
Selection Evals
Does the agent pick the right skill? Dojo watches the agent’s tool calls and checks.
Effectiveness Evals
Does the skill actually work? Dojo evaluates how well an agent uses a skill using LLM-as-a-Judge scoring.
Variant Testing
A/B test different skill variations to find the optimal phrasing and instructions for your agent.
Reporting
Every run saves a detailed report with the agent’s reasoning, so you can review evals and debug failures.
Eval Sandboxing
All skill Evals run in an isolated sandbox environment for security and eval consistency.
Multi-Provider Support
Anthropic, OpenAI, GitHub Copilot, and Vercel AI SDK currently supported.
Dojo is a CLI that automates the tedious parts of evaluating agent skills. You write your evals as YAML, and Dojo handles the rest: presenting prompts to the evaluator model, checking results against your assertions, and saving reports.
You can test variants of skill descriptions, add decoys to stress-test discrimination, and verify that the agent knows when no skill is needed at all.
Skills Dojo assumes your agent skills follow the agent skills specification.
Vercel found that agents fail to select the right skill about 21% of the time. That’s roughly 1 in 5 situations where the agent should have loaded the appropriate skill but didn’t.
But cramming everything into a single AGENTS.md file isn’t the answer either. Skills exist for a reason — they let you encode focused, domain-specific instructions that agents can load on demand.
Evals let you iterate on your skill descriptions and instructions until the agent reliably picks the right tool and uses it effectively. Without measurement, you’re guessing.
Vercel Report: AGENTS.md outperforms skills in our agent evals
SKILL.md file containing frontmatter (name, description) and instructions for the agent.evals/ directory define prompts, variants, and expected outcomes.