Available in: English Français 한국어 Português Türkçe

AI SkillManage incidentOperations

Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill

Name: Incident Response
Author: NickCrew

A Claude Skill for Claude Code by NickCrew — run /incident-response in Claude·Updated Jun 13, 2026·vmain@1d565c1

Compatible withChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Helps teams handle outages or service degradations with severity, owner, status cadence, containment actions, customer-safe updates, stakeholder updates, and blameless postmortem actions.

Classifies severity so the team knows whether this is P0, P1, P2, or P3.
Defines who owns the incident, how often to update people, and what channel is the source of truth.
Turns noisy Slack, ticket, alert, and customer notes into customer-safe status updates.
Separates immediate containment, customer communication, internal coordination, and post-incident follow-up.
Creates postmortem actions with owners, due dates, and prevention checks after the incident is resolved.

YouToday

Support, engineering, and leadership discuss an incident in scattered channels while customers receive late or inconsistent updates.

With /incident-response

Run /incident-response to classify severity, assign an owner, set update cadence, draft status updates, and capture the postmortem trail.

1 Paste incident facts and timeline2 Classify severity and impact3 Draft updates and actions4 Turn the timeline into postmortem follow-up

Who this is for

Support Lead

Turn incident facts into clear severity, customer updates, escalation paths, and follow-up actions.

See skills for this role

Project Manager

Coordinate owners, timeline, next updates, and post-incident action items across teams.

See skills for this role

What it does

Active incident update

Turn scattered incident facts into a clear status update for customers, support, and leadership.

Severity and ownership

Decide how serious the incident is, who owns coordination, and when the next update is due.

Postmortem preparation

Convert the response timeline into root cause, contributing factors, and concrete action items.

How it works

Collect symptoms, user impact, affected services, timeline, current owner, and any customer reports.

Classify severity and choose the communication cadence.

Separate immediate containment from deeper root-cause work.

Draft customer, support, and leadership updates that say what is known, what is not known, and when the next update will arrive.

Record the incident timeline and action tracker while facts are changing.

After resolution, produce a blameless postmortem outline with specific follow-up actions.

Input options

Symptoms and impact

What users are experiencing, how many are affected, and whether data, revenue, or security is involved.

Example

What the user pastes

09:04 support reports checkout failures from 18 customers.
09:07 payments dashboard shows card authorization errors up 32%.
09:10 engineering suspects the new fraud rule rollout.
09:14 Sales says two enterprise trials are blocked.
Impact: checkout fails for Safari users in US/EU. No data loss. Workaround: retry on Chrome works for some users.
Current channels: #support-urgent, #payments-eng, Zendesk tickets, Jira bug PAY-1842.
Need: severity, customer update, owner, internal update, next actions, and postmortem notes.

Useful result

How it reads the incident

Classify as P1: a revenue-critical path is degraded for a meaningful user segment, but there is no total outage, security issue, or data loss. Treat Safari checkout as the affected customer experience, not “all payments down.”

Incident command

Incident commander: Support Lead until Payments Engineering names a technical lead. Source of truth: #inc-checkout-errors. Linked ticket: PAY-1842. Update cadence: every 30 minutes until mitigation, then every 60 minutes until resolved. Next external update due 09:45 UTC.

Customer-safe update

We are investigating elevated checkout errors affecting some Safari users in the US and EU. Some customers may be able to complete checkout in another browser while we isolate the cause. We will share the next update by 09:45 UTC.

Internal stakeholder update

Impact: 18 reported customers plus two blocked enterprise trials. Suspected trigger: fraud-rule rollout. Current hypothesis is not confirmed. Payments Engineering is checking browser-specific authorization failures and preparing rollback if the rule is implicated.

Action tracker

Now: pause or roll back the fraud rule rollout. Owner: Payments Engineering.
Now: link all Zendesk tickets to PAY-1842. Owner: Support.
Next: monitor checkout success by browser and region. Owner: Analytics/Payments.
Next: prepare a saved support reply with workaround and next-update time. Owner: Support Lead.

Postmortem seed

Questions: why did the rollout reach Safari users without browser-specific checkout monitoring? Why did Sales hear from trials before the team had a status update? Follow-up actions should include segmented checkout alerts, rollout checklist update, and incident-linking instructions for Support.

Human review

Confirm severity, legal/compliance wording, and whether the workaround is safe to publish before sending any customer-facing update.

Metrics this improves

Ticket Cycle Time

Helps support and engineering move urgent incidents through ownership and next actions faster.

Operations

Issue Hygiene

Turns incident notes into clear bugs, follow-up actions, owners, and due dates.

Operations

Works with

Slack

manual

Use incident channels, updates, and responder notes as the timeline source.

Jira

manual

Track incident follow-up actions, bugs, and postmortem tasks.

Zendesk

manual

Use customer reports and support tickets to understand user impact.

Want to use Incident Response?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

Production incident in progress (outage, degradation, data loss)
Designing circuit breakers, bulkheads, or fallback strategies
Conducting or planning chaos engineering exercises
Writing or reviewing postmortem documents
Establishing on-call procedures and escalation paths

Avoid when:

The issue is a development-time bug with no production impact
Designing general system architecture (use system-design instead)

Quick Reference

Topic	Load reference
Triage Framework	`skills/incident-response/references/triage-framework.md`
Postmortem Patterns	`skills/incident-response/references/postmortem-patterns.md`

Incident Response Workflow

Phase 1: Detect

Alert fires or user report received
Confirm the issue is real (not a false positive)
Identify affected services and user impact scope

Phase 2: Triage

Classify severity (P0-P3)
Assign incident commander
Open communication channel (war room, Slack channel)
Begin status page updates

Phase 3: Contain

Stop the bleeding: rollback, feature flag, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix first)
Validate in staging if time permits
Deploy with monitoring and rollback plan ready
Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

Document timeline within 48 hours
Conduct blameless review with all participants
Identify root cause and contributing factors
Assign action items with owners and deadlines
Update runbooks and alerting based on lessons learned

Severity Framework

Level	Impact	Response Time	Examples
P0	Complete outage, data loss, security breach	Immediate (< 5 min)	Service down, data corruption, credential leak
P1	Major feature broken, significant user impact	< 30 min	Payment processing failed, auth broken for region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround available	Next business day	UI glitch, slow report generation, cosmetic error

Output

Incident timeline and severity classification
Containment actions taken
Postmortem document with action items
Updated runbooks and alerting rules

Common Mistakes

Skipping severity classification and treating everything as P0
Making changes without a rollback plan
Forgetting to communicate status to stakeholders
Writing postmortems that assign blame instead of identifying systemic issues
Not following up on postmortem action items

Reference documents

name: incident-response description: Incident triage, cascade prevention, and postmortem methodology. Use when handling production incidents, designing resilience patterns, or conducting chaos engineering exercises. keywords:

incident response
outage
postmortem
triage
incident
response

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

Production incident in progress (outage, degradation, data loss)
Designing circuit breakers, bulkheads, or fallback strategies
Conducting or planning chaos engineering exercises
Writing or reviewing postmortem documents
Establishing on-call procedures and escalation paths

Avoid when:

The issue is a development-time bug with no production impact
Designing general system architecture (use system-design instead)

Quick Reference

Topic	Load reference
Triage Framework	`skills/incident-response/references/triage-framework.md`
Postmortem Patterns	`skills/incident-response/references/postmortem-patterns.md`

Incident Response Workflow

Phase 1: Detect

Alert fires or user report received
Confirm the issue is real (not a false positive)
Identify affected services and user impact scope

Phase 2: Triage

Classify severity (P0-P3)
Assign incident commander
Open communication channel (war room, Slack channel)
Begin status page updates

Phase 3: Contain

Stop the bleeding: rollback, feature flag, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix first)
Validate in staging if time permits
Deploy with monitoring and rollback plan ready
Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

Document timeline within 48 hours
Conduct blameless review with all participants
Identify root cause and contributing factors
Assign action items with owners and deadlines
Update runbooks and alerting based on lessons learned

Severity Framework

Level	Impact	Response Time	Examples
P0	Complete outage, data loss, security breach	Immediate (< 5 min)	Service down, data corruption, credential leak
P1	Major feature broken, significant user impact	< 30 min	Payment processing failed, auth broken for region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround available	Next business day	UI glitch, slow report generation, cosmetic error

Output

Incident timeline and severity classification
Containment actions taken
Postmortem document with action items
Updated runbooks and alerting rules

Common Mistakes

Skipping severity classification and treating everything as P0
Making changes without a rollback plan
Forgetting to communicate status to stakeholders
Writing postmortems that assign blame instead of identifying systemic issues
Not following up on postmortem action items

Triage Framework

Severity classification, cascade prevention, communication protocols, and escalation paths for production incidents. Use during active incidents or when establishing incident response procedures.

Severity Classification (P0-P3)

P0 -- Critical

Definition: Complete service outage, active data loss, or security breach affecting all users.

Attribute	Requirement
Response time	< 5 minutes
Incident commander	Required (senior engineer or SRE)
Communication cadence	Every 15 minutes to stakeholders
War room	Immediately opened
Escalation	VP/Director notified within 15 minutes
Postmortem	Required within 48 hours

Examples:

Production database unreachable
Authentication service completely down
Active data corruption or loss
Security breach with confirmed exfiltration
Payment processing halted

P1 -- High

Definition: Major feature broken or significant degradation affecting a large subset of users.

Attribute	Requirement
Response time	< 30 minutes
Incident commander	Required
Communication cadence	Every 30 minutes to stakeholders
War room	Opened if not resolved in 30 minutes
Escalation	Manager notified within 30 minutes
Postmortem	Required within 1 week

Examples:

Payment processing failing for one region
Search functionality returning errors for 20%+ of queries
API latency 10x above normal
Mobile app crash on launch for specific OS version

P2 -- Medium

Definition: Degraded performance or partial feature loss with workarounds available.

Attribute	Requirement
Response time	< 4 hours
Incident commander	Optional (on-call engineer handles)
Communication cadence	Status update at start and resolution
War room	Not required
Escalation	If unresolved after 8 hours
Postmortem	Recommended

Examples:

Elevated latency (2-3x normal) on non-critical endpoints
Background job processing delayed
Non-critical third-party integration down
Report generation slow but functional

P3 -- Low

Definition: Minor issue with minimal user impact. Workaround exists or issue is cosmetic.

Attribute	Requirement
Response time	Next business day
Incident commander	Not required
Communication cadence	Ticket update on resolution
War room	Not required
Escalation	Not required
Postmortem	Not required

Examples:

UI rendering glitch in edge case
Non-critical cron job failed (will retry)
Slow dashboard load for internal tool
Minor logging error that does not affect functionality

Severity Decision Tree

Is data being lost or corrupted?
├─ Yes → P0
└─ No
   Is there a security breach?
   ├─ Yes → P0
   └─ No
      Is the primary service completely down?
      ├─ Yes → P0
      └─ No
         Is a major feature broken for many users?
         ├─ Yes → P1
         └─ No
            Is performance significantly degraded?
            ├─ Yes → P2
            └─ No → P3

Cascade Prevention

Circuit Breakers

Automatically stop calling a failing dependency to prevent cascading failure.

Implementation checklist:

Every external dependency has a circuit breaker
Failure thresholds are tuned per dependency (not one-size-fits-all)
OPEN state returns a meaningful fallback (cached data, degraded response, error)
HALF-OPEN probes are lightweight (health check, not full request)
Circuit breaker state is observable (metrics, dashboard)
Alerts fire when a circuit breaker opens

Configuration template:

Dependency: [service name]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached response | error message | degraded mode]

Bulkhead Isolation

Partition resources so failure in one area cannot exhaust resources for another.

Patterns:

Thread pool isolation: Separate thread pools per dependency
Connection pool isolation: Dedicated connection pools per downstream service
Process isolation: Critical and non-critical workloads in separate processes
Infrastructure isolation: Separate clusters for critical vs batch workloads

Checklist:

Critical path dependencies have dedicated resource pools
Non-critical background work cannot starve critical request handling
Resource limits are set per pool (max connections, max threads)
Pool exhaustion triggers alerts, not silent queuing

Load Shedding

Intentionally drop low-priority work to preserve capacity for high-priority traffic.

Priority tiers:

Priority	Traffic Type	Shed When
Critical	Health checks, authentication	Never
High	Core user requests	> 95% capacity
Medium	Secondary features, analytics	> 80% capacity
Low	Background jobs, prefetch	> 70% capacity

Implementation:

Use request priority headers or path-based classification
Return 503 with Retry-After header for shed requests
Monitor shed rate as a metric (shedding > 0 is an alert)

Graceful Degradation Strategies

Strategy	Description	Example
Feature flags	Disable non-critical features	Turn off recommendations during high load
Cached fallback	Serve stale data	Show cached search results when search service is down
Read-only mode	Disable writes	Allow browsing but not purchasing during payment outage
Static fallback	Serve pre-generated content	Show static landing page when CMS is down
Queue and retry	Accept but defer processing	Accept orders, process when backend recovers

Communication Protocols

Status Page Updates

Template for status page entry:

[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]

Impact: [Brief description of user-visible impact]
Current status: [What we know and what we're doing]
Next update: [When to expect the next update]

Update cadence:

P0: Every 15 minutes until resolved
P1: Every 30 minutes until resolved
P2: At start and resolution
P3: At resolution only

Stakeholder Notification Template

Subject: [P0/P1] [Service] - [Brief impact description]

Severity: P[0-3]
Start time: [ISO 8601 timestamp]
Impact: [Who is affected and how]
Current status: [What we know]
Actions taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
Next update: [When]
Incident commander: [Name]
War room: [Link/channel]

Internal Communication Rules

One source of truth: All updates go through the incident channel, not DMs
Facts, not speculation: Share what you know, flag what you suspect
Timestamp everything: Every action and observation gets a timestamp
No blame: Focus on what happened, not who caused it
Clear handoffs: When rotating, explicitly hand off context

Escalation Paths

Escalation Triggers

Condition	Action
P0 not acknowledged in 5 min	Page backup on-call
P0/P1 not mitigated in 30 min	Escalate to engineering manager
P0 not resolved in 1 hour	Escalate to VP/Director
Any severity affecting revenue	Notify finance and business stakeholders
Security incident confirmed	Notify security team and legal
Data breach suspected	Invoke data breach response plan

Escalation Checklist

Primary on-call paged and acknowledged
If no acknowledgment in 5 min, secondary on-call paged
Incident commander assigned
Relevant team leads notified
Status page updated
Customer support briefed with talking points
Executive stakeholders notified (P0/P1 only)

On-Call Responsibilities

During incident:

Acknowledge page within 5 minutes
Assess severity and open incident channel
Begin investigation and document findings in real time
Coordinate with other teams as needed
Provide status updates at the required cadence

After incident:

Ensure monitoring confirms resolution
Draft incident timeline
Schedule postmortem if required
Update runbooks with any new learnings
Hand off to next on-call if shift ends during incident

Postmortem Patterns

Blameless postmortem structure, root cause analysis techniques, action item tracking, and chaos engineering patterns. Use after incident resolution or when designing resilience testing programs.

Blameless Postmortem Structure

Core Principles

No blame: Focus on systems, processes, and conditions -- not individuals
Assume good intent: Everyone involved was doing their best with the information available
Learn, don't punish: The goal is prevention, not accountability
Share widely: Postmortems are organizational learning, not team shame

Postmortem Document Template

# Incident Postmortem: [Title]

**Date:** [Incident date]
**Severity:** P[0-3]
**Duration:** [Start time] to [End time] ([total duration])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** [Draft | Review | Final]

## Summary

[1-2 sentence description of what happened and the user impact]

## Impact

- **Users affected:** [Number or percentage]
- **Duration:** [How long users experienced the issue]
- **Revenue impact:** [If applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA violations]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| HH:MM | [First signal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [Severity classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |

## Root Cause

[Detailed description of the root cause. What condition or change led to the failure?]

## Contributing Factors

- [Factor 1: e.g., missing monitoring for this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automated rollback configured]

## What Went Well

- [Thing 1: e.g., alert fired within 2 minutes of impact]
- [Thing 2: e.g., team coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth and fast]

## What Went Poorly

- [Thing 1: e.g., took 20 minutes to identify the failing service]
- [Thing 2: e.g., no runbook existed for this failure mode]
- [Thing 3: e.g., status page was not updated for 30 minutes]

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Action description] | [Name] | P1 | [Date] | Open |
| 2 | [Action description] | [Name] | P2 | [Date] | Open |

## Lessons Learned

[Key takeaways that should inform future design, process, or tooling decisions]

Postmortem Meeting Facilitation

Before the meeting:

Draft the postmortem document and share 24 hours in advance
All participants review the timeline for accuracy
Incident commander prepares the root cause analysis

During the meeting (60-90 min):

Timeline review (15 min): Walk through events, correct errors, fill gaps
Root cause discussion (20 min): Apply 5 Whys or fishbone analysis
Contributing factors (15 min): What made the incident worse or harder to resolve?
What went well (10 min): Reinforce effective practices
Action items (20 min): Define concrete, assignable, time-bounded actions

After the meeting:

Finalize the document within 24 hours
Distribute to the broader organization
Enter action items into the tracking system
Schedule follow-up review for action item completion

Root Cause Analysis Techniques

5 Whys

Repeatedly ask "why" to drill past symptoms to the underlying cause.

Example:

Problem: Users received duplicate order confirmation emails.

Why 1: The email service sent the confirmation twice.
Why 2: The order completion event was published twice.
Why 3: The order service retried after a timeout.
Why 4: The message broker acknowledged slowly under load.
Why 5: The broker's disk was 95% full, causing write delays.

Root cause: No disk usage monitoring or alerting on the message broker.
Action: Add disk usage alerting at 80% threshold + auto-scaling.

Guidelines:

Stop when you reach a systemic cause you can fix (process, tooling, design)
Do not stop at "human error" -- ask why the system allowed the error
Some incidents have multiple root causes; run 5 Whys for each branch
Answers should be factual, not speculative

Fishbone Diagram (Ishikawa)

Categorize contributing factors across standard dimensions.

                    ┌─ People: On-call unfamiliar with service
                    ├─ Process: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
                    ├─ Environment: Broker disk at 95%
                    ├─ Monitoring: No disk usage alerts
                    └─ External: Upstream traffic spike

Standard categories:

People: Knowledge gaps, staffing, communication
Process: Missing runbooks, unclear procedures, approval bottlenecks
Technology: Bugs, missing features, architectural gaps
Environment: Infrastructure, capacity, configuration
Monitoring: Missing alerts, incorrect thresholds, observability gaps
External: Third-party outages, traffic spikes, attacks

Fault Tree Analysis

Work backward from the failure to identify all possible causes.

Top event: Service outage
├── AND: Load balancer failure
│   ├── OR: Config error
│   └── OR: Health check misconfigured
└── AND: No failover triggered
    ├── OR: Failover not configured
    └── OR: Failover health check also failed

When to use: Complex incidents with multiple interacting failures where 5 Whys is insufficient.

Action Item Tracking

Action Item Quality Criteria

Every action item must be:

Specific: Clear description of what to do (not "improve monitoring")
Assignable: One owner, not a team
Time-bounded: Due date, not "when we get to it"
Verifiable: Clear definition of done
Prioritized: P1 (before next on-call rotation), P2 (this sprint), P3 (this quarter)

Action Item Categories

Category	Description	Examples
Detection	Improve ability to notice the problem	Add alert, improve dashboard
Prevention	Stop the problem from occurring	Fix bug, add validation, improve architecture
Mitigation	Reduce impact when it happens	Add circuit breaker, improve rollback, write runbook
Process	Improve team response	Update on-call procedures, conduct training

Tracking and Follow-Up

Enter all action items into the team's issue tracker immediately
Tag with postmortem and incident ID for traceability
Review open postmortem action items weekly in team standup
Escalate overdue P1 items to engineering manager
Close action items only when verified complete (not just "code merged")

Action Item Anti-Patterns

Anti-Pattern	Problem	Better Alternative
"Be more careful"	Not actionable	Automate the check
"Improve monitoring"	Too vague	"Add alert for X metric when > Y for Z minutes"
"No owner assigned"	Will not get done	Assign a specific person
"Due: TBD"	Will be deprioritized	Set a concrete date
"Add more tests"	Unbounded	"Add regression test for this specific failure mode"

Chaos Engineering Patterns

Fault Injection

Intentionally introduce failures to verify resilience.

Common fault types:

Fault	Tool/Method	Validates
Kill service instance	Process kill, pod delete	Auto-restart, health checks
Network latency	tc netem, Toxiproxy	Timeout handling, circuit breakers
Network partition	iptables, DNS override	Failover, split-brain handling
Disk full	fallocate, dd	Graceful degradation, alerting
CPU exhaustion	stress-ng	Autoscaling, load shedding
Dependency failure	Mock returning 500s	Fallback paths, error handling
Clock skew	chrony offset	Time-dependent logic

Fault Injection Checklist

Hypothesis defined: "We believe [X] will happen when [fault]"
Blast radius limited (single instance, canary, staging)
Rollback mechanism ready (kill switch for the experiment)
Monitoring in place to observe the effect
Team is aware the experiment is running
Abort criteria defined (stop if real user impact exceeds N%)

Game Days

Structured exercises where teams practice incident response against simulated failures.

Game Day Planning Template:

## Game Day: [Title]

**Date:** [Date and time]
**Duration:** [Expected duration]
**Facilitator:** [Name]
**Participants:** [Team members]

### Scenario
[Description of the simulated incident]

### Objectives
- [ ] Validate alerting detects the failure within [N] minutes
- [ ] Validate team can triage to correct severity
- [ ] Validate mitigation can be applied within [N] minutes
- [ ] Validate communication protocols are followed

### Ground Rules
- This is practice, not evaluation
- Facilitator controls the scenario progression
- Anyone can call "stop" if real production impact is detected
- Document all observations in real time

### Debrief Questions
1. Did alerts fire as expected?
2. Was the right team engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incident?

Game day cadence:

Quarterly for critical services
After major architecture changes
When onboarding new on-call engineers
After any P0 incident (test the fixes)

Resilience Testing Methodology

Resilience Maturity Levels

Level	Description	Activities
1 - Reactive	Fix failures after they happen	Postmortems, basic monitoring
2 - Aware	Know where failures could happen	Failure mode analysis, risk registry
3 - Proactive	Test for failures before they happen	Chaos experiments in staging
4 - Continuous	Regularly validate resilience in production	Automated chaos, game days
5 - Anti-fragile	Systems improve through failure	Feedback loops, auto-remediation

Resilience Testing Checklist

For each critical service, validate:

Single instance failure: Service recovers when one instance dies
Dependency timeout: Service handles slow dependencies gracefully
Dependency outage: Service degrades (not crashes) when dependency is down
Network partition: Service handles split-brain scenarios
Load spike: Service sheds load or scales under 3x normal traffic
Disk full: Service alerts and degrades before crashing
Configuration error: Service fails fast with clear error on bad config
Rollback: Previous version can be deployed within 5 minutes
Data corruption: Backup restore has been tested within the last quarter

Steady-State Hypothesis

Before running any chaos experiment, define what "normal" looks like:

Steady state:
- Request success rate > 99.9%
- p99 latency < 200ms
- Error rate < 0.1%
- No alerts firing

Experiment: Kill 1 of 3 service instances

Hypothesis: Steady state metrics remain within 10% of baseline
            within 60 seconds of the fault injection.

Abort if: Error rate > 5% for more than 30 seconds.

Available in: English Français 한국어 Português Türkçe

AI SkillManage incidentOperations

Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill

A Claude Skill for Claude Code by NickCrew — run /incident-response in Claude·Updated Jun 13, 2026·vmain@1d565c1

Compatible withChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Helps teams handle outages or service degradations with severity, owner, status cadence, containment actions, customer-safe updates, stakeholder updates, and blameless postmortem actions.

Classifies severity so the team knows whether this is P0, P1, P2, or P3.
Defines who owns the incident, how often to update people, and what channel is the source of truth.
Turns noisy Slack, ticket, alert, and customer notes into customer-safe status updates.
Separates immediate containment, customer communication, internal coordination, and post-incident follow-up.
Creates postmortem actions with owners, due dates, and prevention checks after the incident is resolved.

YouToday

Support, engineering, and leadership discuss an incident in scattered channels while customers receive late or inconsistent updates.

With /incident-response

Run /incident-response to classify severity, assign an owner, set update cadence, draft status updates, and capture the postmortem trail.

1 Paste incident facts and timeline2 Classify severity and impact3 Draft updates and actions4 Turn the timeline into postmortem follow-up

Who this is for

Support Lead

Turn incident facts into clear severity, customer updates, escalation paths, and follow-up actions.

See skills for this role

Project Manager

Coordinate owners, timeline, next updates, and post-incident action items across teams.

See skills for this role

What it does

Active incident update

Turn scattered incident facts into a clear status update for customers, support, and leadership.

Severity and ownership

Decide how serious the incident is, who owns coordination, and when the next update is due.

Postmortem preparation

Convert the response timeline into root cause, contributing factors, and concrete action items.

How it works

Collect symptoms, user impact, affected services, timeline, current owner, and any customer reports.

Classify severity and choose the communication cadence.

Separate immediate containment from deeper root-cause work.

Draft customer, support, and leadership updates that say what is known, what is not known, and when the next update will arrive.

Record the incident timeline and action tracker while facts are changing.

After resolution, produce a blameless postmortem outline with specific follow-up actions.

Input options

Symptoms and impact

What users are experiencing, how many are affected, and whether data, revenue, or security is involved.

Example

What the user pastes

09:04 support reports checkout failures from 18 customers.
09:07 payments dashboard shows card authorization errors up 32%.
09:10 engineering suspects the new fraud rule rollout.
09:14 Sales says two enterprise trials are blocked.
Impact: checkout fails for Safari users in US/EU. No data loss. Workaround: retry on Chrome works for some users.
Current channels: #support-urgent, #payments-eng, Zendesk tickets, Jira bug PAY-1842.
Need: severity, customer update, owner, internal update, next actions, and postmortem notes.

Useful result

How it reads the incident

Classify as P1: a revenue-critical path is degraded for a meaningful user segment, but there is no total outage, security issue, or data loss. Treat Safari checkout as the affected customer experience, not “all payments down.”

Incident command

Incident commander: Support Lead until Payments Engineering names a technical lead. Source of truth: #inc-checkout-errors. Linked ticket: PAY-1842. Update cadence: every 30 minutes until mitigation, then every 60 minutes until resolved. Next external update due 09:45 UTC.

Customer-safe update

We are investigating elevated checkout errors affecting some Safari users in the US and EU. Some customers may be able to complete checkout in another browser while we isolate the cause. We will share the next update by 09:45 UTC.

Internal stakeholder update

Impact: 18 reported customers plus two blocked enterprise trials. Suspected trigger: fraud-rule rollout. Current hypothesis is not confirmed. Payments Engineering is checking browser-specific authorization failures and preparing rollback if the rule is implicated.

Action tracker

Now: pause or roll back the fraud rule rollout. Owner: Payments Engineering.
Now: link all Zendesk tickets to PAY-1842. Owner: Support.
Next: monitor checkout success by browser and region. Owner: Analytics/Payments.
Next: prepare a saved support reply with workaround and next-update time. Owner: Support Lead.

Postmortem seed

Questions: why did the rollout reach Safari users without browser-specific checkout monitoring? Why did Sales hear from trials before the team had a status update? Follow-up actions should include segmented checkout alerts, rollout checklist update, and incident-linking instructions for Support.

Human review

Confirm severity, legal/compliance wording, and whether the workaround is safe to publish before sending any customer-facing update.

Metrics this improves

Ticket Cycle Time

Helps support and engineering move urgent incidents through ownership and next actions faster.

Operations

Issue Hygiene

Turns incident notes into clear bugs, follow-up actions, owners, and due dates.

Operations

Works with

Slack

manual

Use incident channels, updates, and responder notes as the timeline source.

Jira

manual

Track incident follow-up actions, bugs, and postmortem tasks.

Zendesk

manual

Use customer reports and support tickets to understand user impact.

Want to use Incident Response?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

Production incident in progress (outage, degradation, data loss)
Designing circuit breakers, bulkheads, or fallback strategies
Conducting or planning chaos engineering exercises
Writing or reviewing postmortem documents
Establishing on-call procedures and escalation paths

Avoid when:

The issue is a development-time bug with no production impact
Designing general system architecture (use system-design instead)

Quick Reference

Topic	Load reference
Triage Framework	`skills/incident-response/references/triage-framework.md`
Postmortem Patterns	`skills/incident-response/references/postmortem-patterns.md`

Incident Response Workflow

Phase 1: Detect

Alert fires or user report received
Confirm the issue is real (not a false positive)
Identify affected services and user impact scope

Phase 2: Triage

Classify severity (P0-P3)
Assign incident commander
Open communication channel (war room, Slack channel)
Begin status page updates

Phase 3: Contain

Stop the bleeding: rollback, feature flag, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix first)
Validate in staging if time permits
Deploy with monitoring and rollback plan ready
Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

Document timeline within 48 hours
Conduct blameless review with all participants
Identify root cause and contributing factors
Assign action items with owners and deadlines
Update runbooks and alerting based on lessons learned

Severity Framework

Level	Impact	Response Time	Examples
P0	Complete outage, data loss, security breach	Immediate (< 5 min)	Service down, data corruption, credential leak
P1	Major feature broken, significant user impact	< 30 min	Payment processing failed, auth broken for region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround available	Next business day	UI glitch, slow report generation, cosmetic error

Output

Incident timeline and severity classification
Containment actions taken
Postmortem document with action items
Updated runbooks and alerting rules

Common Mistakes

Skipping severity classification and treating everything as P0
Making changes without a rollback plan
Forgetting to communicate status to stakeholders
Writing postmortems that assign blame instead of identifying systemic issues
Not following up on postmortem action items

Reference documents

incident response
outage
postmortem
triage
incident
response

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

Production incident in progress (outage, degradation, data loss)
Designing circuit breakers, bulkheads, or fallback strategies
Conducting or planning chaos engineering exercises
Writing or reviewing postmortem documents
Establishing on-call procedures and escalation paths

Avoid when:

The issue is a development-time bug with no production impact
Designing general system architecture (use system-design instead)

Quick Reference

Topic	Load reference
Triage Framework	`skills/incident-response/references/triage-framework.md`
Postmortem Patterns	`skills/incident-response/references/postmortem-patterns.md`

Incident Response Workflow

Phase 1: Detect

Alert fires or user report received
Confirm the issue is real (not a false positive)
Identify affected services and user impact scope

Phase 2: Triage

Classify severity (P0-P3)
Assign incident commander
Open communication channel (war room, Slack channel)
Begin status page updates

Phase 3: Contain

Stop the bleeding: rollback, feature flag, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix first)
Validate in staging if time permits
Deploy with monitoring and rollback plan ready
Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

Document timeline within 48 hours
Conduct blameless review with all participants
Identify root cause and contributing factors
Assign action items with owners and deadlines
Update runbooks and alerting based on lessons learned

Severity Framework

Level	Impact	Response Time	Examples
P0	Complete outage, data loss, security breach	Immediate (< 5 min)	Service down, data corruption, credential leak
P1	Major feature broken, significant user impact	< 30 min	Payment processing failed, auth broken for region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround available	Next business day	UI glitch, slow report generation, cosmetic error

Output

Incident timeline and severity classification
Containment actions taken
Postmortem document with action items
Updated runbooks and alerting rules

Common Mistakes

Skipping severity classification and treating everything as P0
Making changes without a rollback plan
Forgetting to communicate status to stakeholders
Writing postmortems that assign blame instead of identifying systemic issues
Not following up on postmortem action items

Triage Framework

Severity classification, cascade prevention, communication protocols, and escalation paths for production incidents. Use during active incidents or when establishing incident response procedures.

Severity Classification (P0-P3)

P0 -- Critical

Definition: Complete service outage, active data loss, or security breach affecting all users.

Attribute	Requirement
Response time	< 5 minutes
Incident commander	Required (senior engineer or SRE)
Communication cadence	Every 15 minutes to stakeholders
War room	Immediately opened
Escalation	VP/Director notified within 15 minutes
Postmortem	Required within 48 hours

Examples:

Production database unreachable
Authentication service completely down
Active data corruption or loss
Security breach with confirmed exfiltration
Payment processing halted

P1 -- High

Definition: Major feature broken or significant degradation affecting a large subset of users.

Attribute	Requirement
Response time	< 30 minutes
Incident commander	Required
Communication cadence	Every 30 minutes to stakeholders
War room	Opened if not resolved in 30 minutes
Escalation	Manager notified within 30 minutes
Postmortem	Required within 1 week

Examples:

Payment processing failing for one region
Search functionality returning errors for 20%+ of queries
API latency 10x above normal
Mobile app crash on launch for specific OS version

P2 -- Medium

Definition: Degraded performance or partial feature loss with workarounds available.

Attribute	Requirement
Response time	< 4 hours
Incident commander	Optional (on-call engineer handles)
Communication cadence	Status update at start and resolution
War room	Not required
Escalation	If unresolved after 8 hours
Postmortem	Recommended

Examples:

Elevated latency (2-3x normal) on non-critical endpoints
Background job processing delayed
Non-critical third-party integration down
Report generation slow but functional

P3 -- Low

Definition: Minor issue with minimal user impact. Workaround exists or issue is cosmetic.

Attribute	Requirement
Response time	Next business day
Incident commander	Not required
Communication cadence	Ticket update on resolution
War room	Not required
Escalation	Not required
Postmortem	Not required

Examples:

UI rendering glitch in edge case
Non-critical cron job failed (will retry)
Slow dashboard load for internal tool
Minor logging error that does not affect functionality

Severity Decision Tree

Is data being lost or corrupted?
├─ Yes → P0
└─ No
   Is there a security breach?
   ├─ Yes → P0
   └─ No
      Is the primary service completely down?
      ├─ Yes → P0
      └─ No
         Is a major feature broken for many users?
         ├─ Yes → P1
         └─ No
            Is performance significantly degraded?
            ├─ Yes → P2
            └─ No → P3

Cascade Prevention

Circuit Breakers

Automatically stop calling a failing dependency to prevent cascading failure.

Implementation checklist:

Every external dependency has a circuit breaker
Failure thresholds are tuned per dependency (not one-size-fits-all)
OPEN state returns a meaningful fallback (cached data, degraded response, error)
HALF-OPEN probes are lightweight (health check, not full request)
Circuit breaker state is observable (metrics, dashboard)
Alerts fire when a circuit breaker opens

Configuration template:

Dependency: [service name]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached response | error message | degraded mode]

Bulkhead Isolation

Partition resources so failure in one area cannot exhaust resources for another.

Patterns:

Thread pool isolation: Separate thread pools per dependency
Connection pool isolation: Dedicated connection pools per downstream service
Process isolation: Critical and non-critical workloads in separate processes
Infrastructure isolation: Separate clusters for critical vs batch workloads

Checklist:

Critical path dependencies have dedicated resource pools
Non-critical background work cannot starve critical request handling
Resource limits are set per pool (max connections, max threads)
Pool exhaustion triggers alerts, not silent queuing

Load Shedding

Intentionally drop low-priority work to preserve capacity for high-priority traffic.

Priority tiers:

Priority	Traffic Type	Shed When
Critical	Health checks, authentication	Never
High	Core user requests	> 95% capacity
Medium	Secondary features, analytics	> 80% capacity
Low	Background jobs, prefetch	> 70% capacity

Implementation:

Use request priority headers or path-based classification
Return 503 with Retry-After header for shed requests
Monitor shed rate as a metric (shedding > 0 is an alert)

Graceful Degradation Strategies

Strategy	Description	Example
Feature flags	Disable non-critical features	Turn off recommendations during high load
Cached fallback	Serve stale data	Show cached search results when search service is down
Read-only mode	Disable writes	Allow browsing but not purchasing during payment outage
Static fallback	Serve pre-generated content	Show static landing page when CMS is down
Queue and retry	Accept but defer processing	Accept orders, process when backend recovers

Communication Protocols

Status Page Updates

Template for status page entry:

[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]

Impact: [Brief description of user-visible impact]
Current status: [What we know and what we're doing]
Next update: [When to expect the next update]

Update cadence:

P0: Every 15 minutes until resolved
P1: Every 30 minutes until resolved
P2: At start and resolution
P3: At resolution only

Stakeholder Notification Template

Subject: [P0/P1] [Service] - [Brief impact description]

Severity: P[0-3]
Start time: [ISO 8601 timestamp]
Impact: [Who is affected and how]
Current status: [What we know]
Actions taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
Next update: [When]
Incident commander: [Name]
War room: [Link/channel]

Internal Communication Rules

One source of truth: All updates go through the incident channel, not DMs
Facts, not speculation: Share what you know, flag what you suspect
Timestamp everything: Every action and observation gets a timestamp
No blame: Focus on what happened, not who caused it
Clear handoffs: When rotating, explicitly hand off context

Escalation Paths

Escalation Triggers

Condition	Action
P0 not acknowledged in 5 min	Page backup on-call
P0/P1 not mitigated in 30 min	Escalate to engineering manager
P0 not resolved in 1 hour	Escalate to VP/Director
Any severity affecting revenue	Notify finance and business stakeholders
Security incident confirmed	Notify security team and legal
Data breach suspected	Invoke data breach response plan

Escalation Checklist

Primary on-call paged and acknowledged
If no acknowledgment in 5 min, secondary on-call paged
Incident commander assigned
Relevant team leads notified
Status page updated
Customer support briefed with talking points
Executive stakeholders notified (P0/P1 only)

On-Call Responsibilities

During incident:

Acknowledge page within 5 minutes
Assess severity and open incident channel
Begin investigation and document findings in real time
Coordinate with other teams as needed
Provide status updates at the required cadence

After incident:

Ensure monitoring confirms resolution
Draft incident timeline
Schedule postmortem if required
Update runbooks with any new learnings
Hand off to next on-call if shift ends during incident

Postmortem Patterns

Blameless postmortem structure, root cause analysis techniques, action item tracking, and chaos engineering patterns. Use after incident resolution or when designing resilience testing programs.

Blameless Postmortem Structure

Core Principles

No blame: Focus on systems, processes, and conditions -- not individuals
Assume good intent: Everyone involved was doing their best with the information available
Learn, don't punish: The goal is prevention, not accountability
Share widely: Postmortems are organizational learning, not team shame

Postmortem Document Template

# Incident Postmortem: [Title]

**Date:** [Incident date]
**Severity:** P[0-3]
**Duration:** [Start time] to [End time] ([total duration])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** [Draft | Review | Final]

## Summary

[1-2 sentence description of what happened and the user impact]

## Impact

- **Users affected:** [Number or percentage]
- **Duration:** [How long users experienced the issue]
- **Revenue impact:** [If applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA violations]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| HH:MM | [First signal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [Severity classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |

## Root Cause

[Detailed description of the root cause. What condition or change led to the failure?]

## Contributing Factors

- [Factor 1: e.g., missing monitoring for this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automated rollback configured]

## What Went Well

- [Thing 1: e.g., alert fired within 2 minutes of impact]
- [Thing 2: e.g., team coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth and fast]

## What Went Poorly

- [Thing 1: e.g., took 20 minutes to identify the failing service]
- [Thing 2: e.g., no runbook existed for this failure mode]
- [Thing 3: e.g., status page was not updated for 30 minutes]

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Action description] | [Name] | P1 | [Date] | Open |
| 2 | [Action description] | [Name] | P2 | [Date] | Open |

## Lessons Learned

[Key takeaways that should inform future design, process, or tooling decisions]

Postmortem Meeting Facilitation

Before the meeting:

Draft the postmortem document and share 24 hours in advance
All participants review the timeline for accuracy
Incident commander prepares the root cause analysis

During the meeting (60-90 min):

Timeline review (15 min): Walk through events, correct errors, fill gaps
Root cause discussion (20 min): Apply 5 Whys or fishbone analysis
Contributing factors (15 min): What made the incident worse or harder to resolve?
What went well (10 min): Reinforce effective practices
Action items (20 min): Define concrete, assignable, time-bounded actions

After the meeting:

Finalize the document within 24 hours
Distribute to the broader organization
Enter action items into the tracking system
Schedule follow-up review for action item completion

Root Cause Analysis Techniques

5 Whys

Repeatedly ask "why" to drill past symptoms to the underlying cause.

Example:

Problem: Users received duplicate order confirmation emails.

Why 1: The email service sent the confirmation twice.
Why 2: The order completion event was published twice.
Why 3: The order service retried after a timeout.
Why 4: The message broker acknowledged slowly under load.
Why 5: The broker's disk was 95% full, causing write delays.

Root cause: No disk usage monitoring or alerting on the message broker.
Action: Add disk usage alerting at 80% threshold + auto-scaling.

Guidelines:

Stop when you reach a systemic cause you can fix (process, tooling, design)
Do not stop at "human error" -- ask why the system allowed the error
Some incidents have multiple root causes; run 5 Whys for each branch
Answers should be factual, not speculative

Fishbone Diagram (Ishikawa)

Categorize contributing factors across standard dimensions.

                    ┌─ People: On-call unfamiliar with service
                    ├─ Process: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
                    ├─ Environment: Broker disk at 95%
                    ├─ Monitoring: No disk usage alerts
                    └─ External: Upstream traffic spike

Standard categories:

People: Knowledge gaps, staffing, communication
Process: Missing runbooks, unclear procedures, approval bottlenecks
Technology: Bugs, missing features, architectural gaps
Environment: Infrastructure, capacity, configuration
Monitoring: Missing alerts, incorrect thresholds, observability gaps
External: Third-party outages, traffic spikes, attacks

Fault Tree Analysis

Work backward from the failure to identify all possible causes.

Top event: Service outage
├── AND: Load balancer failure
│   ├── OR: Config error
│   └── OR: Health check misconfigured
└── AND: No failover triggered
    ├── OR: Failover not configured
    └── OR: Failover health check also failed

When to use: Complex incidents with multiple interacting failures where 5 Whys is insufficient.

Action Item Tracking

Action Item Quality Criteria

Every action item must be:

Specific: Clear description of what to do (not "improve monitoring")
Assignable: One owner, not a team
Time-bounded: Due date, not "when we get to it"
Verifiable: Clear definition of done
Prioritized: P1 (before next on-call rotation), P2 (this sprint), P3 (this quarter)

Action Item Categories

Category	Description	Examples
Detection	Improve ability to notice the problem	Add alert, improve dashboard
Prevention	Stop the problem from occurring	Fix bug, add validation, improve architecture
Mitigation	Reduce impact when it happens	Add circuit breaker, improve rollback, write runbook
Process	Improve team response	Update on-call procedures, conduct training

Tracking and Follow-Up

Enter all action items into the team's issue tracker immediately
Tag with postmortem and incident ID for traceability
Review open postmortem action items weekly in team standup
Escalate overdue P1 items to engineering manager
Close action items only when verified complete (not just "code merged")

Action Item Anti-Patterns

Anti-Pattern	Problem	Better Alternative
"Be more careful"	Not actionable	Automate the check
"Improve monitoring"	Too vague	"Add alert for X metric when > Y for Z minutes"
"No owner assigned"	Will not get done	Assign a specific person
"Due: TBD"	Will be deprioritized	Set a concrete date
"Add more tests"	Unbounded	"Add regression test for this specific failure mode"

Chaos Engineering Patterns

Fault Injection

Intentionally introduce failures to verify resilience.

Common fault types:

Fault	Tool/Method	Validates
Kill service instance	Process kill, pod delete	Auto-restart, health checks
Network latency	tc netem, Toxiproxy	Timeout handling, circuit breakers
Network partition	iptables, DNS override	Failover, split-brain handling
Disk full	fallocate, dd	Graceful degradation, alerting
CPU exhaustion	stress-ng	Autoscaling, load shedding
Dependency failure	Mock returning 500s	Fallback paths, error handling
Clock skew	chrony offset	Time-dependent logic

Fault Injection Checklist

Hypothesis defined: "We believe [X] will happen when [fault]"
Blast radius limited (single instance, canary, staging)
Rollback mechanism ready (kill switch for the experiment)
Monitoring in place to observe the effect
Team is aware the experiment is running
Abort criteria defined (stop if real user impact exceeds N%)

Game Days

Structured exercises where teams practice incident response against simulated failures.

Game Day Planning Template:

## Game Day: [Title]

**Date:** [Date and time]
**Duration:** [Expected duration]
**Facilitator:** [Name]
**Participants:** [Team members]

### Scenario
[Description of the simulated incident]

### Objectives
- [ ] Validate alerting detects the failure within [N] minutes
- [ ] Validate team can triage to correct severity
- [ ] Validate mitigation can be applied within [N] minutes
- [ ] Validate communication protocols are followed

### Ground Rules
- This is practice, not evaluation
- Facilitator controls the scenario progression
- Anyone can call "stop" if real production impact is detected
- Document all observations in real time

### Debrief Questions
1. Did alerts fire as expected?
2. Was the right team engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incident?

Game day cadence:

Quarterly for critical services
After major architecture changes
When onboarding new on-call engineers
After any P0 incident (test the fixes)

Resilience Testing Methodology

Resilience Maturity Levels

Level	Description	Activities
1 - Reactive	Fix failures after they happen	Postmortems, basic monitoring
2 - Aware	Know where failures could happen	Failure mode analysis, risk registry
3 - Proactive	Test for failures before they happen	Chaos experiments in staging
4 - Continuous	Regularly validate resilience in production	Automated chaos, game days
5 - Anti-fragile	Systems improve through failure	Feedback loops, auto-remediation

Resilience Testing Checklist

For each critical service, validate:

Single instance failure: Service recovers when one instance dies
Dependency timeout: Service handles slow dependencies gracefully
Dependency outage: Service degrades (not crashes) when dependency is down
Network partition: Service handles split-brain scenarios
Load spike: Service sheds load or scales under 3x normal traffic
Disk full: Service alerts and degrades before crashing
Configuration error: Service fails fast with clear error on bad config
Rollback: Previous version can be deployed within 5 minutes
Data corruption: Backup restore has been tested within the last quarter

Steady-State Hypothesis

Before running any chaos experiment, define what "normal" looks like:

Steady state:
- Request success rate > 99.9%
- p99 latency < 200ms
- Error rate < 0.1%
- No alerts firing

Experiment: Kill 1 of 3 service instances

Hypothesis: Steady state metrics remain within 10% of baseline
            within 60 seconds of the fault injection.

Abort if: Error rate > 5% for more than 30 seconds.

Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill

Who this is for

What it does

How it works

Input options

Example

Metrics this improves

Works with

Want to use Incident Response?

Skill instructions

Incident Response

When to Use

Quick Reference

Incident Response Workflow

Phase 1: Detect

Phase 2: Triage

Phase 3: Contain

Phase 4: Resolve

Phase 5: Postmortem

Severity Framework

Output

Common Mistakes

Reference documents

Incident Response

When to Use

Quick Reference

Incident Response Workflow

Phase 1: Detect

Phase 2: Triage

Phase 3: Contain

Phase 4: Resolve

Phase 5: Postmortem

Severity Framework

Output

Common Mistakes

Triage Framework

Severity Classification (P0-P3)

P0 -- Critical

P1 -- High

P2 -- Medium

P3 -- Low

Severity Decision Tree

Cascade Prevention

Circuit Breakers

Bulkhead Isolation

Load Shedding

Graceful Degradation Strategies

Communication Protocols

Status Page Updates

Stakeholder Notification Template

Internal Communication Rules

Escalation Paths

Escalation Triggers

Escalation Checklist

On-Call Responsibilities

Postmortem Patterns

Blameless Postmortem Structure

Core Principles

Postmortem Document Template

Postmortem Meeting Facilitation

Root Cause Analysis Techniques

5 Whys

Fishbone Diagram (Ishikawa)

Fault Tree Analysis

Action Item Tracking

Action Item Quality Criteria

Action Item Categories

Tracking and Follow-Up

Action Item Anti-Patterns

Chaos Engineering Patterns

Fault Injection

Fault Injection Checklist

Game Days

Resilience Testing Methodology

Resilience Maturity Levels

Resilience Testing Checklist

Steady-State Hypothesis

Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill

Who this is for

What it does