Selection Evals
Test whether an agent selects the correct skill for a given scenario.
Selection Evals
Test whether an agent selects the correct skill for a given scenario.
Variants
Define alternative descriptions for a skill and evaluate if the agent still selects it.
Decoys
Add fake skills to the selection pool to test skill selection discrimination.
Reporting
Every run generates a detailed report and logs of the agent’s reasoning process for analysis and debugging.
At its base, Dojo is simply a CLI that automates away the annoying parts of running agent skills evaluations. Define your evals for every skill as a simple YAML file, and Dojo will handle the rest: presenting the prompt and skill options to your evaluator model, checking the results against your assertions, and saving detailed reports for analysis.
You can describe variants, mimic real-world scenarios with decoys, and even test whether your agent can correctly identify when no skill is needed at all.
Skills Dojo assumes your agent skills follow the agent skills specification.
Vercel published an analysis comparing AGENTS.md files against Skills, and found that agents tend to underperform when it comes to selection 21% of the time. This means that for roughly 1 in 5 situations when an agent should’ve loaded the appropriate skill, it didn’t.
However when it comes to encoding the informational and operational context an agent needs for cross-cutting concerns, throwing everything into the AGENTS.md file isn’t a viable solution.
This is where evals come in for the Skills your team is writing: they allow you to test and iterate on the descriptions and instructions for your skills, ensuring that your agent can select the right tool for the job when it matters most.
Vercel Report: AGENTS.md outperforms skills in our agent evals
SKILL.md file containing frontmatter (name, description) and instructions.evals/ directory define prompts, variants, and expected outcomes.