When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill
A Claude Skill for Claude Code by Refound — run /ai-evals in Claude·Updated
Build a pass/fail eval rubric and test set from real failure traces
- Error analysis workflow (Hamel Husain, Shreya Shankar): open coding of failure traces, cluster into patterns, convert to rubric items
- Evals as the real PRD: per Brendan Foody, 'if the model is the product, the eval is the PRD' — executable requirements over prose
- Pass/fail binary decisions only, no 1-5 Likert scales that produce meaningless averages
- LLM-as-judge scaffolding with a human-validation loop, so the judge is anchored to expert agreement
- Coverage report: which failure modes are tested, which are still blind spots
Who this is for
What it does
Your support-bot accuracy drops after a prompt tweak and you can't tell why. /ai-evals runs error analysis on 50 traces, clusters 6 failure modes, and builds a pass/fail rubric you can re-run after every change. Per Hamel Husain and Shreya Shankar.
Your team argues over whether a response is 'good'. /ai-evals forces the rubric conversation: each dimension gets a specific, measurable criterion with examples. Replaces Likert averages with the pass/fail decisions Brendan Foody calls the new PRD.
You're using GPT-4 to score your Claude outputs and trusting the score. /ai-evals adds a 20-trace human-validation step so you know the judge agrees with domain experts before you let it gate your CI.
Your new agent writes code, calls tools, and edits docs. /ai-evals designs a test set that covers the top 5 failure modes you've already seen plus synthetic adversarial cases, so you can ship with a regression bar instead of vibes.
How it works
Paste 20-50 real traces of your LLM output (success and failure)
Run open coding: label each trace with what's wrong in plain language
Cluster the labels into 4-8 failure-mode categories with examples
Generate a pass/fail rubric plus an LLM-as-judge prompt anchored to the categories
Get a re-runnable eval set with coverage report and a human-validation checklist
Example
50 support-bot conversations from last week 12 flagged by CSAT as bad Model: Claude 3.5 Sonnet, RAG over help docs Complaint: 'it makes up policies'
1. Hallucinated policy 18/50 traces (biggest) 2. Wrong doc cited 9/50 3. Refuses in-scope question 7/50 4. Tone: cold 5/50 5. Hedge-spam 4/50 6. Wrong language 2/50
policy_grounded: every policy claim cites a real doc ID PASS/FAIL right_citation: cited doc actually supports the claim PASS/FAIL in_scope_answered: in-scope questions get a real answer PASS/FAIL tone_acceptable: no 'I cannot help'-style refusals PASS/FAIL
You are judging a support-bot reply. For each dimension output PASS or FAIL and one sentence why. Anchor: compare against the reference doc provided. Do NOT use Likert scales. [rubric injected]
→ Validate the judge: label 20 traces yourself, check agreement with judge >85% → Baseline: current model scores 64% policy_grounded → Target: 90% before ship
Metrics this improves
Works with
AI Evals
Help the user create systematic evaluations for AI products using insights from AI practitioners.
How to Help
When the user asks for help with AI evals:
- Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
- Help design the eval approach - Suggest rubrics, test cases, and measurement methods
- Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
- Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics
Core Principles
Evals are the new PRD
Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.
Evals are a core product skill
Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.
The workflow matters
Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.
Questions to Help Users
- "What does 'good' look like for this AI output?"
- "What are the most common failure modes you've seen?"
- "How will you know if the model got better or worse?"
- "Are you measuring what users actually care about?"
- "Have you manually reviewed enough outputs to understand failure patterns?"
Common Mistakes to Flag
- Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
- Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
- LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
- Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages
Deep Dive
For all 2 insights from 2 guests, see references/guest-insights.md
Related Skills
- Building with LLMs
- AI Product Strategy
- Evaluating New Technology
Reference documents
AI Evaluation (Evals) - All Guest Insights
2 guests, 2 mentions
Hamel Husain & Shreya Shankar
Hamel Husain & Shreya Shankar
"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."
Insight: The guests explicitly define this as a 'new skill' that is distinct from traditional software testing or general AI strategy. It involves a specific multi-step workflow (Error Analysis, Open Coding, A
Brendan Foody
Brendan Foody
"If the model is the product, then the eval is the product requirement document."
Insight: The guest explicitly states we are entering the 'era of evals' and describes it as a core bottleneck for AI labs. It involves creating rubrics, benchmarks, and systematic tests to measure model capabi