AI 스킬Build EvalsProduct & Engineering

When your LLM feature keeps regressing, /ai-evals builds a failure-mode rubric and test set, so you can tell if the next change made it better. — Claude Skill

Claude Code용 Claude 스킬 · 제공: Refound · 실행: /ai-evals (Claude 내)·업데이트: 2026년 4월 11일

호환ChatGPT·Claude·Gemini·OpenClaw

Build a pass/fail eval rubric and test set from real failure traces

  • Error analysis workflow (Hamel Husain, Shreya Shankar): open coding of failure traces, cluster into patterns, convert to rubric items
  • Evals as the real PRD: per Brendan Foody, 'if the model is the product, the eval is the PRD' — executable requirements over prose
  • Pass/fail binary decisions only, no 1-5 Likert scales that produce meaningless averages
  • LLM-as-judge scaffolding with a human-validation loop, so the judge is anchored to expert agreement
  • Coverage report: which failure modes are tested, which are still blind spots

대상

기능

LLM feature ships, then silently regresses on a prompt change

Your support-bot accuracy drops after a prompt tweak and you can't tell why. /ai-evals runs error analysis on 50 traces, clusters 6 failure modes, and builds a pass/fail rubric you can re-run after every change. Per Hamel Husain and Shreya Shankar.

'Good output' is vague and PMs disagree

Your team argues over whether a response is 'good'. /ai-evals forces the rubric conversation: each dimension gets a specific, measurable criterion with examples. Replaces Likert averages with the pass/fail decisions Brendan Foody calls the new PRD.

LLM-as-judge but nobody validated the judge

You're using GPT-4 to score your Claude outputs and trusting the score. /ai-evals adds a 20-trace human-validation step so you know the judge agrees with domain experts before you let it gate your CI.

Launching an agentic feature with no safety net

Your new agent writes code, calls tools, and edits docs. /ai-evals designs a test set that covers the top 5 failure modes you've already seen plus synthetic adversarial cases, so you can ship with a regression bar instead of vibes.

작동 방식

1

Paste 20-50 real traces of your LLM output (success and failure)

2

Run open coding: label each trace with what's wrong in plain language

3

Cluster the labels into 4-8 failure-mode categories with examples

4

Generate a pass/fail rubric plus an LLM-as-judge prompt anchored to the categories

5

Get a re-runnable eval set with coverage report and a human-validation checklist

예시

Your traces (50 conversations)
50 support-bot conversations from last week
12 flagged by CSAT as bad
Model: Claude 3.5 Sonnet, RAG over help docs
Complaint: 'it makes up policies'
30 minutes later
Failure Modes (6 clusters)
1. Hallucinated policy       18/50 traces   (biggest)
2. Wrong doc cited            9/50
3. Refuses in-scope question  7/50
4. Tone: cold                 5/50
5. Hedge-spam                 4/50
6. Wrong language             2/50
Pass/Fail Rubric
policy_grounded:    every policy claim cites a real doc ID  PASS/FAIL
right_citation:     cited doc actually supports the claim  PASS/FAIL
in_scope_answered:  in-scope questions get a real answer   PASS/FAIL
tone_acceptable:    no 'I cannot help'-style refusals       PASS/FAIL
LLM-as-Judge Prompt
You are judging a support-bot reply. For each dimension output PASS or FAIL and one sentence why. Anchor: compare against the reference doc provided. Do NOT use Likert scales.
[rubric injected]
Next Steps
→ Validate the judge: label 20 traces yourself, check agreement with judge >85%
→ Baseline: current model scores 64% policy_grounded
→ Target: 90% before ship

개선되는 지표

Time to Value
Automated eval suites shorten the iteration loop on AI features
Product & Engineering
Content Quality
Executable evals catch regressions before they reach users
Product & Engineering
Data Quality
LLM-as-judge rubrics enforce ground-truth labeling discipline
Product & Engineering

지원 도구

AI Evals을(를) 사용해 보시겠어요?

시작 방법을 선택하세요.

Claude Code에서 실행
무료. 오픈 소스.

이 스킬을 컴퓨터에 로컬로 설치하고 실행합니다.

1
Claude Code 설치

컴퓨터에서 터미널을 열고 이 명령을 붙여넣으세요:

2
스킬 설치

이 명령은 스킬과 모든 파일을 컴퓨터에 다운로드합니다:

모든 프로젝트에서 사용하려면 끝에 -g를 추가하세요.

3
실행하기

Claude Code를 시작한 다음 명령을 입력하세요:

그다음
GitHub에서 소스 보기
ElasticFlow에서 사용
팀 및 협업 기능

브라우저에서 스킬을 실행. 결과 공유, 액세스 관리, 팀과 협업. 터미널 불필요.

14일 무료 평가판. 언제든 취소 가능.

AI Evals

Help the user create systematic evaluations for AI products using insights from AI practitioners.

How to Help

When the user asks for help with AI evals:

  1. Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
  2. Help design the eval approach - Suggest rubrics, test cases, and measurement methods
  3. Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
  4. Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

Core Principles

Evals are the new PRD

Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.

Evals are a core product skill

Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.

The workflow matters

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

Questions to Help Users

  • "What does 'good' look like for this AI output?"
  • "What are the most common failure modes you've seen?"
  • "How will you know if the model got better or worse?"
  • "Are you measuring what users actually care about?"
  • "Have you manually reviewed enough outputs to understand failure patterns?"

Common Mistakes to Flag

  • Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
  • Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
  • LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
  • Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

Deep Dive

For all 2 insights from 2 guests, see references/guest-insights.md

Related Skills

  • Building with LLMs
  • AI Product Strategy
  • Evaluating New Technology

Reference documents

AI Evaluation (Evals) - All Guest Insights

2 guests, 2 mentions


Hamel Husain & Shreya Shankar

Hamel Husain & Shreya Shankar

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

Insight: The guests explicitly define this as a 'new skill' that is distinct from traditional software testing or general AI strategy. It involves a specific multi-step workflow (Error Analysis, Open Coding, A

Brendan Foody

Brendan Foody

"If the model is the product, then the eval is the product requirement document."

Insight: The guest explicitly states we are entering the 'era of evals' and describes it as a core bottleneck for AI labs. It involves creating rubrics, benchmarks, and systematic tests to measure model capabi