Available in: English Français 한국어 Português Türkçe

AI SkillBuild EvalsProduct & Engineering

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

Name: AI Evals
Author: Refound

A Claude Skill for Claude Code by Refound — run /ai-evals in Claude·Updated Apr 11, 2026

Compatible withChatGPT

ClaudeClaude CodeClaude DesktopCodex / Codex CLI

Cursor

GeminiHermes (via Continue / Cline)

OpenClaw

Windsurf

Build a pass/fail eval rubric and test set from real failure traces

Error analysis workflow (Hamel Husain, Shreya Shankar): open coding of failure traces, cluster into patterns, convert to rubric items
Evals as the real PRD: per Brendan Foody, 'if the model is the product, the eval is the PRD' — executable requirements over prose
Pass/fail binary decisions only, no 1-5 Likert scales that produce meaningless averages
LLM-as-judge scaffolding with a human-validation loop, so the judge is anchored to expert agreement
Coverage report: which failure modes are tested, which are still blind spots

Who this is for

Founder

Build LLM-as-judge evals that turn product requirements into automated quality gates

See skills for this role

What it does

LLM feature ships, then silently regresses on a prompt change

Your support-bot accuracy drops after a prompt tweak and you can't tell why. /ai-evals runs error analysis on 50 traces, clusters 6 failure modes, and builds a pass/fail rubric you can re-run after every change. Per Hamel Husain and Shreya Shankar.

'Good output' is vague and PMs disagree

Your team argues over whether a response is 'good'. /ai-evals forces the rubric conversation: each dimension gets a specific, measurable criterion with examples. Replaces Likert averages with the pass/fail decisions Brendan Foody calls the new PRD.

LLM-as-judge but nobody validated the judge

You're using GPT-4 to score your Claude outputs and trusting the score. /ai-evals adds a 20-trace human-validation step so you know the judge agrees with domain experts before you let it gate your CI.

Launching an agentic feature with no safety net

Your new agent writes code, calls tools, and edits docs. /ai-evals designs a test set that covers the top 5 failure modes you've already seen plus synthetic adversarial cases, so you can ship with a regression bar instead of vibes.

How it works

Paste 20-50 real traces of your LLM output (success and failure)

Run open coding: label each trace with what's wrong in plain language

Cluster the labels into 4-8 failure-mode categories with examples

Generate a pass/fail rubric plus an LLM-as-judge prompt anchored to the categories

Get a re-runnable eval set with coverage report and a human-validation checklist

Example

Your traces (50 conversations)

50 support-bot conversations from last week
12 flagged by CSAT as bad
Model: Claude 3.5 Sonnet, RAG over help docs
Complaint: 'it makes up policies'

30 minutes later

Failure Modes (6 clusters)

1. Hallucinated policy       18/50 traces   (biggest)
2. Wrong doc cited            9/50
3. Refuses in-scope question  7/50
4. Tone: cold                 5/50
5. Hedge-spam                 4/50
6. Wrong language             2/50

Pass/Fail Rubric

policy_grounded:    every policy claim cites a real doc ID  PASS/FAIL
right_citation:     cited doc actually supports the claim  PASS/FAIL
in_scope_answered:  in-scope questions get a real answer   PASS/FAIL
tone_acceptable:    no 'I cannot help'-style refusals       PASS/FAIL

LLM-as-Judge Prompt

You are judging a support-bot reply. For each dimension output PASS or FAIL and one sentence why. Anchor: compare against the reference doc provided. Do NOT use Likert scales.
[rubric injected]

Next Steps

→ Validate the judge: label 20 traces yourself, check agreement with judge >85%
→ Baseline: current model scores 64% policy_grounded
→ Target: 90% before ship

Metrics this improves

Time to Value

Automated eval suites shorten the iteration loop on AI features

Product & Engineering

Content Quality

Executable evals catch regressions before they reach users

Product & Engineering

Data Quality

LLM-as-judge rubrics enforce ground-truth labeling discipline

Product & Engineering

Works with

Google Sheets

manual

Track eval scores, pass rates, and regression trends across runs

Jira

manual

Log failure modes discovered in production as eval-linked tickets

Notion

manual

Document eval rubrics and failure modes alongside product requirements

Want to use AI Evals?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

AI Evals

Help the user create systematic evaluations for AI products using insights from AI practitioners.

How to Help

When the user asks for help with AI evals:

Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
Help design the eval approach - Suggest rubrics, test cases, and measurement methods
Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

Core Principles

Evals are the new PRD

Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.

Evals are a core product skill

Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.

The workflow matters

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

Questions to Help Users

"What does 'good' look like for this AI output?"
"What are the most common failure modes you've seen?"
"How will you know if the model got better or worse?"
"Are you measuring what users actually care about?"
"Have you manually reviewed enough outputs to understand failure patterns?"

Common Mistakes to Flag

Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

Deep Dive

For all 2 insights from 2 guests, see references/guest-insights.md

Related Skills

Building with LLMs
AI Product Strategy
Evaluating New Technology

Reference documents

AI Evaluation (Evals) - All Guest Insights

2 guests, 2 mentions

Hamel Husain & Shreya Shankar

Hamel Husain & Shreya Shankar

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

Insight: The guests explicitly define this as a 'new skill' that is distinct from traditional software testing or general AI strategy. It involves a specific multi-step workflow (Error Analysis, Open Coding, A

Brendan Foody

Brendan Foody

"If the model is the product, then the eval is the product requirement document."

Insight: The guest explicitly states we are entering the 'era of evals' and describes it as a core bottleneck for AI labs. It involves creating rubrics, benchmarks, and systematic tests to measure model capabi

Available in: English Français 한국어 Português Türkçe

AI SkillBuild EvalsProduct & Engineering

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

A Claude Skill for Claude Code by Refound — run /ai-evals in Claude·Updated Apr 11, 2026

Compatible withChatGPT

ClaudeClaude CodeClaude DesktopCodex / Codex CLI

Cursor

GeminiHermes (via Continue / Cline)

OpenClaw

Windsurf

Build a pass/fail eval rubric and test set from real failure traces

Error analysis workflow (Hamel Husain, Shreya Shankar): open coding of failure traces, cluster into patterns, convert to rubric items
Evals as the real PRD: per Brendan Foody, 'if the model is the product, the eval is the PRD' — executable requirements over prose
Pass/fail binary decisions only, no 1-5 Likert scales that produce meaningless averages
LLM-as-judge scaffolding with a human-validation loop, so the judge is anchored to expert agreement
Coverage report: which failure modes are tested, which are still blind spots

Who this is for

Founder

Build LLM-as-judge evals that turn product requirements into automated quality gates

See skills for this role

What it does

LLM feature ships, then silently regresses on a prompt change

'Good output' is vague and PMs disagree

LLM-as-judge but nobody validated the judge

Launching an agentic feature with no safety net

How it works

Paste 20-50 real traces of your LLM output (success and failure)

Run open coding: label each trace with what's wrong in plain language

Cluster the labels into 4-8 failure-mode categories with examples

Generate a pass/fail rubric plus an LLM-as-judge prompt anchored to the categories

Get a re-runnable eval set with coverage report and a human-validation checklist

Example

Your traces (50 conversations)

50 support-bot conversations from last week
12 flagged by CSAT as bad
Model: Claude 3.5 Sonnet, RAG over help docs
Complaint: 'it makes up policies'

30 minutes later

Failure Modes (6 clusters)

1. Hallucinated policy       18/50 traces   (biggest)
2. Wrong doc cited            9/50
3. Refuses in-scope question  7/50
4. Tone: cold                 5/50
5. Hedge-spam                 4/50
6. Wrong language             2/50

Pass/Fail Rubric

policy_grounded:    every policy claim cites a real doc ID  PASS/FAIL
right_citation:     cited doc actually supports the claim  PASS/FAIL
in_scope_answered:  in-scope questions get a real answer   PASS/FAIL
tone_acceptable:    no 'I cannot help'-style refusals       PASS/FAIL

LLM-as-Judge Prompt

You are judging a support-bot reply. For each dimension output PASS or FAIL and one sentence why. Anchor: compare against the reference doc provided. Do NOT use Likert scales.
[rubric injected]

Next Steps

→ Validate the judge: label 20 traces yourself, check agreement with judge >85%
→ Baseline: current model scores 64% policy_grounded
→ Target: 90% before ship

Metrics this improves

Time to Value

Automated eval suites shorten the iteration loop on AI features

Product & Engineering

Content Quality

Executable evals catch regressions before they reach users

Product & Engineering

Data Quality

LLM-as-judge rubrics enforce ground-truth labeling discipline

Product & Engineering

Works with

Google Sheets

manual

Track eval scores, pass rates, and regression trends across runs

Jira

manual

Log failure modes discovered in production as eval-linked tickets

Notion

manual

Document eval rubrics and failure modes alongside product requirements

Want to use AI Evals?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

AI Evals

Help the user create systematic evaluations for AI products using insights from AI practitioners.

How to Help

When the user asks for help with AI evals:

Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
Help design the eval approach - Suggest rubrics, test cases, and measurement methods
Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

Core Principles

Evals are the new PRD

Evals are a core product skill

The workflow matters

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

Questions to Help Users

"What does 'good' look like for this AI output?"
"What are the most common failure modes you've seen?"
"How will you know if the model got better or worse?"
"Are you measuring what users actually care about?"
"Have you manually reviewed enough outputs to understand failure patterns?"

Common Mistakes to Flag

Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

Deep Dive

For all 2 insights from 2 guests, see references/guest-insights.md

Related Skills

Building with LLMs
AI Product Strategy
Evaluating New Technology

Reference documents

AI Evaluation (Evals) - All Guest Insights

2 guests, 2 mentions

Hamel Husain & Shreya Shankar

Hamel Husain & Shreya Shankar

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

Brendan Foody

Brendan Foody

"If the model is the product, then the eval is the product requirement document."

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

Who this is for

What it does

How it works

Example

Metrics this improves

Works with

Want to use AI Evals?

Skill instructions

AI Evals

How to Help

Core Principles

Evals are the new PRD

Evals are a core product skill

The workflow matters

Questions to Help Users

Common Mistakes to Flag

Deep Dive

Related Skills

Reference documents

AI Evaluation (Evals) - All Guest Insights

Hamel Husain & Shreya Shankar

Brendan Foody

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

Who this is for

What it does

How it works

Example

Metrics this improves

Works with

Want to use AI Evals?

Skill instructions

AI Evals

How to Help

Core Principles

Evals are the new PRD

Evals are a core product skill

The workflow matters

Questions to Help Users

Common Mistakes to Flag

Deep Dive

Related Skills

Reference documents

AI Evaluation (Evals) - All Guest Insights

Hamel Husain & Shreya Shankar

Brendan Foody