Available in: English Français 한국어 Português Türkçe

AI SkillA/B Test AnalysisMarketing

Decide whether an experiment should ship, stop, or keep running. — Claude Skill

Name: A/B Test Analysis
Author: Pawel Huryn

A Claude Skill for Claude Code by Pawel Huryn — run /ab-test-analysis in Claude·Updated Jun 13, 2026·vphuryn/pm-skills@ab-test-analysis

Compatible withChatGPT

ClaudeClaude CodeClaude DesktopCodex / Codex CLI

Cursor

GeminiHermes (via Continue / Cline)

OpenClaw

Windsurf

Reads experiment results, sample size, conversion changes, guardrail metrics, and business context to recommend a clear ship, stop, or continue decision.

Explains experiment results in plain language instead of only reporting a p-value or dashboard screenshot.
Checks primary metric, sample size, segment differences, and guardrail metrics before recommending a decision.
Separates meaningful lift from noise, novelty effects, broken tracking, or mixed segment behavior.
Returns a decision memo with evidence, risk, next test idea, and what a human should confirm.

YouToday

A growth marketer screenshots the experiment dashboard, says the test is up, and debates confidence in a meeting.

With /ab-test-analysis

Run /ab-test-analysis with the result table and context. The skill returns a decision, evidence, risks, and follow-up test.

1 Paste result table2 Check guardrails3 Interpret decision risk4 Write ship/stop/continue memo

Who this is for

Growth Marketer

Turn experiment results into clear launch, stop, or continue decisions.

See skills for this role

Product Manager

Understand experiment impact on user behavior, product risk, and next iteration.

See skills for this role

Analytics Engineer

Spot tracking, sample, and guardrail issues before stakeholders trust the readout.

See skills for this role

What it does

Growth experiment readout

Turn Optimizely, Amplitude, or GA results into a decision memo.

Guardrail review

Check whether a conversion lift came with revenue, support, speed, or retention risk.

Experiment design critique

Find tracking, segment, sample size, or timing problems before trusting the result.

How it works

Share the experiment goal, variants, dates, traffic, sample size, and metric results.

Add guardrail metrics such as churn, revenue, refund rate, support tickets, or page speed if available.

The skill interprets lift, confidence, practical significance, and business risk.

It recommends ship, stop, keep running, or re-run with a cleaner design.

Input options

Experiment setup

Hypothesis, variants, dates, traffic split, audience, and success metric.

Example

Experiment results

Hypothesis: changing CTA from 'Start trial' to 'Create your workspace' increases trial starts.
Duration: 21 days
Traffic split: 50/50
Control: 18,420 visitors, 1,105 trial starts, revenue per visitor $3.12
Variant: 18,390 visitors, 1,236 trial starts, revenue per visitor $3.20
Guardrails: checkout error rate, page load time, paid conversion after 7 days.

A/B test result summary

A/B Test Results: pricing CTA

**Hypothesis:** Workspace-oriented CTA increases trial starts by making the next step clearer.
**Duration:** 21 days | **Sample:** 18,420 control / 18,390 variant
**Setup check:** full business cycles covered; traffic split is balanced; guardrails available.

Metric table

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---:|---:|---:|---:|---|
| Trial start rate | 6.0% | 6.7% | +12.0% | 0.018 | Yes |
| Revenue per visitor | $3.12 | $3.20 | +2.6% | 0.41 | No |
| Checkout error rate | 1.1% | 1.2% | +0.1 pp | 0.62 | No |
| Page load p75 | 2.1s | 2.2s | +0.1s | - | No concern |

Recommendation

**Ship it to 100%.** Primary metric has statistically significant positive lift and guardrails did not degrade. Treat revenue lift as directional only because it is not significant yet.

Next steps

1. Roll out the variant CTA.
2. Monitor paid conversion for one more cohort window.
3. Add a follow-up test on the onboarding step after trial start.
4. Document that this test improves trial starts, not proven revenue yet.

Metrics this improves

Conversion Rate

+5-20%

Marketing

Statistical Significance

Decision risk reduced

Marketing

Metric Trust

+20-40%

Marketing

Works with

Google Sheets

manual

Compare result tables and write the decision memo.

Optimizely

manual

Use experiment results, variants, confidence, and traffic allocation.

Amplitude

manual

Check product behavior, activation, retention, and segment impact.

google-analytics

manual

Use traffic, conversion, and acquisition context.

Works anywhere

Standalone

No setup required

Paste the notes, exports, screenshots, or summaries you already have. The skill works without a connected system.

Connected

CRM + tools integrated

Connect the relevant support, analytics, CRM, or data tool when you want fresher source evidence.

Want to use A/B Test Analysis?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
  - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  - Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win

Interpret results:

Outcome	Recommendation
Significant positive lift, no guardrail issues	Ship it — roll out to 100%
Significant positive lift, guardrail concerns	Investigate — understand trade-offs before shipping
Not significant, positive trend	Extend the test — need more data or larger effect
Not significant, flat	Stop the test — no meaningful difference detected
Significant negative lift	Don't ship — revert to control, analyze why

Provide the analysis summary:

## A/B Test Results: [Test Name]

**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |

**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.

Reference documents

name: ab-test-analysis description: "Analyze A/B test results with statistical significance, sample size validation, confidence intervals, and ship/extend/stop recommendations. Use when evaluating experiment results, checking if a test reached significance, interpreting split test data, or deciding whether to ship a variant."

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
  - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  - Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win

Interpret results:

Outcome	Recommendation
Significant positive lift, no guardrail issues	Ship it — roll out to 100%
Significant positive lift, guardrail concerns	Investigate — understand trade-offs before shipping
Not significant, positive trend	Extend the test — need more data or larger effect
Not significant, flat	Stop the test — no meaningful difference detected
Significant negative lift	Don't ship — revert to control, analyze why

Provide the analysis summary:

## A/B Test Results: [Test Name]

**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |

**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.

Hypothesis: changing CTA from 'Start trial' to 'Create your workspace' increases trial starts.
Duration: 21 days
Traffic split: 50/50
Control: 18,420 visitors, 1,105 trial starts, revenue per visitor $3.12
Variant: 18,390 visitors, 1,236 trial starts, revenue per visitor $3.20
Guardrails: checkout error rate, page load time, paid conversion after 7 days.

A/B test result summary

A/B Test Results: pricing CTA

**Hypothesis:** Workspace-oriented CTA increases trial starts by making the next step clearer.
**Duration:** 21 days | **Sample:** 18,420 control / 18,390 variant
**Setup check:** full business cycles covered; traffic split is balanced; guardrails available.

Metric table

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---:|---:|---:|---:|---|
| Trial start rate | 6.0% | 6.7% | +12.0% | 0.018 | Yes |
| Revenue per visitor | $3.12 | $3.20 | +2.6% | 0.41 | No |
| Checkout error rate | 1.1% | 1.2% | +0.1 pp | 0.62 | No |
| Page load p75 | 2.1s | 2.2s | +0.1s | - | No concern |

Recommendation

**Ship it to 100%.** Primary metric has statistically significant positive lift and guardrails did not degrade. Treat revenue lift as directional only because it is not significant yet.

Next steps

1. Roll out the variant CTA.
2. Monitor paid conversion for one more cohort window.
3. Add a follow-up test on the onboarding step after trial start.
4. Document that this test improves trial starts, not proven revenue yet.

Outcome

Recommendation

Significant positive lift, no guardrail issues

Ship it — roll out to 100%

Significant positive lift, guardrail concerns

Investigate — understand trade-offs before shipping

Not significant, positive trend

Extend the test — need more data or larger effect

Not significant, flat

Stop the test — no meaningful difference detected

Significant negative lift

Don't ship — revert to control, analyze why

## A/B Test Results: [Test Name] **Hypothesis**: [What we expected] **Duration**: [X days] | **Sample**: [N control / M variant] | Metric | Control | Variant | Lift | p-value | Significant? | |---|---|---|---|---|---| | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No | | [Guardrail] | ... | ... | ... | ... | ... | **Recommendation**: [Ship / Extend / Stop / Investigate] **Reasoning**: [Why] **Next steps**: [What to do]

Outcome

Recommendation

Significant positive lift, no guardrail issues

Ship it — roll out to 100%

Significant positive lift, guardrail concerns

Investigate — understand trade-offs before shipping

Not significant, positive trend

Extend the test — need more data or larger effect

Not significant, flat

Stop the test — no meaningful difference detected

Significant negative lift

Don't ship — revert to control, analyze why

Decide whether an experiment should ship, stop, or keep running. — Claude Skill

Who this is for

What it does

How it works

Input options

Example

Metrics this improves

Works with

Works anywhere

Want to use A/B Test Analysis?

Skill instructions

A/B Test Analysis

Context

Instructions

Further Reading

Reference documents

A/B Test Analysis

Context

Instructions

Further Reading

Decide whether an experiment should ship, stop, or keep running. — Claude Skill

Who this is for

What it does

How it works

Input options

Example

Metrics this improves

Works with

Works anywhere

Want to use A/B Test Analysis?

Skill instructions

A/B Test Analysis

Context

Instructions

Further Reading

Reference documents

A/B Test Analysis

Context

Instructions

Further Reading