Decide whether an experiment should ship, stop, or keep running. — Claude Skill
A Claude Skill for Claude Code by Paweł Huryn — run /ab-test-analysis in Claude·Updated Jun 12, 2026·vphuryn/pm-skills@ab-test-analysis
Reads experiment results, sample size, conversion changes, guardrail metrics, and business context to recommend a clear ship, stop, or continue decision.
- Explains experiment results in plain language instead of only reporting a p-value or dashboard screenshot.
- Checks primary metric, sample size, segment differences, and guardrail metrics before recommending a decision.
- Separates meaningful lift from noise, novelty effects, broken tracking, or mixed segment behavior.
- Returns a decision memo with evidence, risk, next test idea, and what a human should confirm.
A growth marketer screenshots the experiment dashboard, says the test is up, and debates confidence in a meeting.
Run /ab-test-analysis with the result table and context. The skill returns a decision, evidence, risks, and follow-up test.
Who this is for
Turn experiment results into clear launch, stop, or continue decisions.
See skills for this roleUnderstand experiment impact on user behavior, product risk, and next iteration.
See skills for this roleSpot tracking, sample, and guardrail issues before stakeholders trust the readout.
See skills for this roleWhat it does
Turn Optimizely, Amplitude, or GA results into a decision memo.
Check whether a conversion lift came with revenue, support, speed, or retention risk.
Find tracking, segment, sample size, or timing problems before trusting the result.
How it works
Share the experiment goal, variants, dates, traffic, sample size, and metric results.
Add guardrail metrics such as churn, revenue, refund rate, support tickets, or page speed if available.
The skill interprets lift, confidence, practical significance, and business risk.
It recommends ship, stop, keep running, or re-run with a cleaner design.
Input options
Hypothesis, variants, dates, traffic split, audience, and success metric.
Example
Experiment: Pricing page CTA copy. Variant A: Start free trial. Variant B: Build my plan. Dates: June 1-14. Results: - A: 24,100 visitors, 1,084 trials, 4.5% conversion. - B: 23,900 visitors, 1,267 trials, 5.3% conversion. - Confidence shown in tool: 96%. Guardrails: - Paid conversion after trial: A 18.4%, B 17.9%. - Support questions about pricing increased 11% for B. Need: ship, stop, or continue, and what to tell leadership.
Ship to 50% first, not 100%. Variant B improves trial start rate from 4.5% to 5.3%, but paid conversion is slightly lower and pricing questions increased.
The lift is likely real for trial starts. The business impact is not proven until trial quality and support load are watched for one more week.
Pricing confusion may be rising. Add a pricing FAQ link near the CTA before full rollout.
The new CTA increases trial starts by about 18% relative, but we will roll out gradually while monitoring paid conversion and pricing support tickets.
Confirm attribution window, whether paid conversion is mature enough, and whether support ticket tagging is consistent.
Metrics this improves
Works with
Compare result tables and write the decision memo.
Use experiment results, variants, confidence, and traffic allocation.
Check product behavior, activation, retention, and segment impact.
Use traffic, conversion, and acquisition context.
Works anywhere
Paste the notes, exports, screenshots, or summaries you already have. The skill works without a connected system.
Connect the relevant support, analytics, CRM, or data tool when you want fresher source evidence.
Want to use A/B Test Analysis?
Choose how to get started.
Install and run this skill locally on your computer.
Open a terminal on your computer and paste this command:
This downloads the skill with all its files to your computer:
Add -g at the end to make it available in all your projects.
Start Claude Code, then type the command:
A/B Test Analysis
Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.
Context
You are analyzing A/B test results for $ARGUMENTS.
If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.
Instructions
-
Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
-
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
- Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
- Sample size: Is the sample large enough for the expected effect size?
-
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
-
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
-
Interpret results:
Outcome Recommendation Significant positive lift, no guardrail issues Ship it — roll out to 100% Significant positive lift, guardrail concerns Investigate — understand trade-offs before shipping Not significant, positive trend Extend the test — need more data or larger effect Not significant, flat Stop the test — no meaningful difference detected Significant negative lift Don't ship — revert to control, analyze why -
Provide the analysis summary:
## A/B Test Results: [Test Name] **Hypothesis**: [What we expected] **Duration**: [X days] | **Sample**: [N control / M variant] | Metric | Control | Variant | Lift | p-value | Significant? | |---|---|---|---|---|---| | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No | | [Guardrail] | ... | ... | ... | ... | ... | **Recommendation**: [Ship / Extend / Stop / Investigate] **Reasoning**: [Why] **Next steps**: [What to do]
Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.
Further Reading
Reference documents
name: ab-test-analysis description: "Analyze A/B test results with statistical significance, sample size validation, confidence intervals, and ship/extend/stop recommendations. Use when evaluating experiment results, checking if a test reached significance, interpreting split test data, or deciding whether to ship a variant."
A/B Test Analysis
Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.
Context
You are analyzing A/B test results for $ARGUMENTS.
If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.
Instructions
-
Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
-
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
- Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
- Sample size: Is the sample large enough for the expected effect size?
-
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
-
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
-
Interpret results:
Outcome Recommendation Significant positive lift, no guardrail issues Ship it — roll out to 100% Significant positive lift, guardrail concerns Investigate — understand trade-offs before shipping Not significant, positive trend Extend the test — need more data or larger effect Not significant, flat Stop the test — no meaningful difference detected Significant negative lift Don't ship — revert to control, analyze why -
Provide the analysis summary:
## A/B Test Results: [Test Name] **Hypothesis**: [What we expected] **Duration**: [X days] | **Sample**: [N control / M variant] | Metric | Control | Variant | Lift | p-value | Significant? | |---|---|---|---|---|---| | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No | | [Guardrail] | ... | ... | ... | ... | ... | **Recommendation**: [Ship / Extend / Stop / Investigate] **Reasoning**: [Why] **Next steps**: [What to do]
Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.
Further Reading
Source marketplace page: https://github.com/phuryn/pm-skills/blob/HEAD/pm-data-analytics/skills/ab-test-analysis/SKILL.md
Install command: npx skills add phuryn/pm-skills@ab-test-analysis