AI SkillBuild EvalsProduct & Engineering

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

A Claude Skill for Claude Code by Refound — run /ai-evals in Claude·Updated

Compatible withChatGPT·Claude·Gemini·OpenClaw

Build a pass/fail eval rubric and test set from real failure traces

  • Error analysis workflow (Hamel Husain, Shreya Shankar): open coding of failure traces, cluster into patterns, convert to rubric items
  • Evals as the real PRD: per Brendan Foody, 'if the model is the product, the eval is the PRD' — executable requirements over prose
  • Pass/fail binary decisions only, no 1-5 Likert scales that produce meaningless averages
  • LLM-as-judge scaffolding with a human-validation loop, so the judge is anchored to expert agreement
  • Coverage report: which failure modes are tested, which are still blind spots

Who this is for

What it does

LLM feature ships, then silently regresses on a prompt change

Your support-bot accuracy drops after a prompt tweak and you can't tell why. /ai-evals runs error analysis on 50 traces, clusters 6 failure modes, and builds a pass/fail rubric you can re-run after every change. Per Hamel Husain and Shreya Shankar.

'Good output' is vague and PMs disagree

Your team argues over whether a response is 'good'. /ai-evals forces the rubric conversation: each dimension gets a specific, measurable criterion with examples. Replaces Likert averages with the pass/fail decisions Brendan Foody calls the new PRD.

LLM-as-judge but nobody validated the judge

You're using GPT-4 to score your Claude outputs and trusting the score. /ai-evals adds a 20-trace human-validation step so you know the judge agrees with domain experts before you let it gate your CI.

Launching an agentic feature with no safety net

Your new agent writes code, calls tools, and edits docs. /ai-evals designs a test set that covers the top 5 failure modes you've already seen plus synthetic adversarial cases, so you can ship with a regression bar instead of vibes.

How it works

1

Paste 20-50 real traces of your LLM output (success and failure)

2

Run open coding: label each trace with what's wrong in plain language

3

Cluster the labels into 4-8 failure-mode categories with examples

4

Generate a pass/fail rubric plus an LLM-as-judge prompt anchored to the categories

5

Get a re-runnable eval set with coverage report and a human-validation checklist

Example

Your traces (50 conversations)
50 support-bot conversations from last week
12 flagged by CSAT as bad
Model: Claude 3.5 Sonnet, RAG over help docs
Complaint: 'it makes up policies'
30 minutes later
Failure Modes (6 clusters)
1. Hallucinated policy       18/50 traces   (biggest)
2. Wrong doc cited            9/50
3. Refuses in-scope question  7/50
4. Tone: cold                 5/50
5. Hedge-spam                 4/50
6. Wrong language             2/50
Pass/Fail Rubric
policy_grounded:    every policy claim cites a real doc ID  PASS/FAIL
right_citation:     cited doc actually supports the claim  PASS/FAIL
in_scope_answered:  in-scope questions get a real answer   PASS/FAIL
tone_acceptable:    no 'I cannot help'-style refusals       PASS/FAIL
LLM-as-Judge Prompt
You are judging a support-bot reply. For each dimension output PASS or FAIL and one sentence why. Anchor: compare against the reference doc provided. Do NOT use Likert scales.
[rubric injected]
Next Steps
→ Validate the judge: label 20 traces yourself, check agreement with judge >85%
→ Baseline: current model scores 64% policy_grounded
→ Target: 90% before ship

Metrics this improves

Time to Value
Automated eval suites shorten the iteration loop on AI features
Product & Engineering
Content Quality
Executable evals catch regressions before they reach users
Product & Engineering
Data Quality
LLM-as-judge rubrics enforce ground-truth labeling discipline
Product & Engineering

Works with

AI Evals

Help the user create systematic evaluations for AI products using insights from AI practitioners.

How to Help

When the user asks for help with AI evals:

  1. Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
  2. Help design the eval approach - Suggest rubrics, test cases, and measurement methods
  3. Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
  4. Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

Core Principles

Evals are the new PRD

Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.

Evals are a core product skill

Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.

The workflow matters

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

Questions to Help Users

  • "What does 'good' look like for this AI output?"
  • "What are the most common failure modes you've seen?"
  • "How will you know if the model got better or worse?"
  • "Are you measuring what users actually care about?"
  • "Have you manually reviewed enough outputs to understand failure patterns?"

Common Mistakes to Flag

  • Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
  • Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
  • LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
  • Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

Deep Dive

For all 2 insights from 2 guests, see references/guest-insights.md

Related Skills

  • Building with LLMs
  • AI Product Strategy
  • Evaluating New Technology

Reference documents

AI Evaluation (Evals) - All Guest Insights

2 guests, 2 mentions


Hamel Husain & Shreya Shankar

Hamel Husain & Shreya Shankar

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

Insight: The guests explicitly define this as a 'new skill' that is distinct from traditional software testing or general AI strategy. It involves a specific multi-step workflow (Error Analysis, Open Coding, A

Brendan Foody

Brendan Foody

"If the model is the product, then the eval is the product requirement document."

Insight: The guest explicitly states we are entering the 'era of evals' and describes it as a core bottleneck for AI labs. It involves creating rubrics, benchmarks, and systematic tests to measure model capabi