Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill
A Claude Skill for Claude Code by NickCrew — run /incident-response in Claude·Updated Jun 13, 2026·vmain@1d565c1
Helps teams handle outages or service degradations with severity, owner, status cadence, containment actions, customer-safe updates, stakeholder updates, and blameless postmortem actions.
- Classifies severity so the team knows whether this is P0, P1, P2, or P3.
- Defines who owns the incident, how often to update people, and what channel is the source of truth.
- Turns noisy Slack, ticket, alert, and customer notes into customer-safe status updates.
- Separates immediate containment, customer communication, internal coordination, and post-incident follow-up.
- Creates postmortem actions with owners, due dates, and prevention checks after the incident is resolved.
Support, engineering, and leadership discuss an incident in scattered channels while customers receive late or inconsistent updates.
Run /incident-response to classify severity, assign an owner, set update cadence, draft status updates, and capture the postmortem trail.
Who this is for
What it does
Turn scattered incident facts into a clear status update for customers, support, and leadership.
Decide how serious the incident is, who owns coordination, and when the next update is due.
Convert the response timeline into root cause, contributing factors, and concrete action items.
How it works
Collect symptoms, user impact, affected services, timeline, current owner, and any customer reports.
Classify severity and choose the communication cadence.
Separate immediate containment from deeper root-cause work.
Draft customer, support, and leadership updates that say what is known, what is not known, and when the next update will arrive.
Record the incident timeline and action tracker while facts are changing.
After resolution, produce a blameless postmortem outline with specific follow-up actions.
Input options
What users are experiencing, how many are affected, and whether data, revenue, or security is involved.
Example
09:04 Support reports checkout failures from 18 customers. 09:07 Payments dashboard shows authorization errors up 32%. 09:10 Engineering suspects fraud-rule rollout. 09:14 Sales says two enterprise trials are blocked. Impact: Safari users in US/EU. No data loss. Workaround: another browser may work.
| Field | Decision | |---|---| | Severity | P1 - revenue path degraded for a meaningful user segment | | Incident commander | Support Lead until Payments Engineering names technical lead | | Source of truth | #inc-checkout-errors | | Update cadence | Every 30 minutes until mitigation | | Linked issue | PAY-1842 |
We are investigating elevated checkout errors affecting some Safari users in the US and EU. Some customers may be able to complete checkout in another browser while we isolate the cause. Next update by 09:45 UTC.
| Time | Action | Owner | Status | |---|---|---|---| | Now | Pause or roll back fraud-rule rollout | Payments Eng | In progress | | Now | Link all Zendesk tickets to PAY-1842 | Support | Needed | | Next | Monitor success rate by browser and region | Analytics | Needed |
| Question | Follow-up candidate | |---|---| | Why did browser-specific failures escape rollout checks? | Add checkout alerts by browser and region | | Why did Sales hear first from trials? | Add enterprise-trial notification path during P1 incidents |
Metrics this improves
Works with
Want to use Incident Response?
Choose how to get started.
Install and run this skill locally on your computer.
Open a terminal on your computer and paste this command:
This downloads the skill with all its files to your computer:
Add -g at the end to make it available in all your projects.
Start Claude Code, then type the command:
Incident Response
Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.
When to Use
- Production incident in progress (outage, degradation, data loss)
- Designing circuit breakers, bulkheads, or fallback strategies
- Conducting or planning chaos engineering exercises
- Writing or reviewing postmortem documents
- Establishing on-call procedures and escalation paths
Avoid when:
- The issue is a development-time bug with no production impact
- Designing general system architecture (use system-design instead)
Quick Reference
| Topic | Load reference |
|---|---|
| Triage Framework | skills/incident-response/references/triage-framework.md |
| Postmortem Patterns | skills/incident-response/references/postmortem-patterns.md |
Incident Response Workflow
Phase 1: Detect
- Alert fires or user report received
- Confirm the issue is real (not a false positive)
- Identify affected services and user impact scope
Phase 2: Triage
- Classify severity (P0-P3)
- Assign incident commander
- Open communication channel (war room, Slack channel)
- Begin status page updates
Phase 3: Contain
- Stop the bleeding: rollback, feature flag, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates every 15 minutes for P0/P1
Phase 4: Resolve
- Implement fix (minimal viable fix first)
- Validate in staging if time permits
- Deploy with monitoring and rollback plan ready
- Confirm recovery with metrics returning to baseline
Phase 5: Postmortem
- Document timeline within 48 hours
- Conduct blameless review with all participants
- Identify root cause and contributing factors
- Assign action items with owners and deadlines
- Update runbooks and alerting based on lessons learned
Severity Framework
| Level | Impact | Response Time | Examples |
|---|---|---|---|
| P0 | Complete outage, data loss, security breach | Immediate (< 5 min) | Service down, data corruption, credential leak |
| P1 | Major feature broken, significant user impact | < 30 min | Payment processing failed, auth broken for region |
| P2 | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| P3 | Minor issue, workaround available | Next business day | UI glitch, slow report generation, cosmetic error |
Output
- Incident timeline and severity classification
- Containment actions taken
- Postmortem document with action items
- Updated runbooks and alerting rules
Common Mistakes
- Skipping severity classification and treating everything as P0
- Making changes without a rollback plan
- Forgetting to communicate status to stakeholders
- Writing postmortems that assign blame instead of identifying systemic issues
- Not following up on postmortem action items
Reference documents
name: incident-response description: Incident triage, cascade prevention, and postmortem methodology. Use when handling production incidents, designing resilience patterns, or conducting chaos engineering exercises. keywords:
- incident response
- outage
- postmortem
- triage
- incident
- response
Incident Response
Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.
When to Use
- Production incident in progress (outage, degradation, data loss)
- Designing circuit breakers, bulkheads, or fallback strategies
- Conducting or planning chaos engineering exercises
- Writing or reviewing postmortem documents
- Establishing on-call procedures and escalation paths
Avoid when:
- The issue is a development-time bug with no production impact
- Designing general system architecture (use system-design instead)
Quick Reference
| Topic | Load reference |
|---|---|
| Triage Framework | skills/incident-response/references/triage-framework.md |
| Postmortem Patterns | skills/incident-response/references/postmortem-patterns.md |
Incident Response Workflow
Phase 1: Detect
- Alert fires or user report received
- Confirm the issue is real (not a false positive)
- Identify affected services and user impact scope
Phase 2: Triage
- Classify severity (P0-P3)
- Assign incident commander
- Open communication channel (war room, Slack channel)
- Begin status page updates
Phase 3: Contain
- Stop the bleeding: rollback, feature flag, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates every 15 minutes for P0/P1
Phase 4: Resolve
- Implement fix (minimal viable fix first)
- Validate in staging if time permits
- Deploy with monitoring and rollback plan ready
- Confirm recovery with metrics returning to baseline
Phase 5: Postmortem
- Document timeline within 48 hours
- Conduct blameless review with all participants
- Identify root cause and contributing factors
- Assign action items with owners and deadlines
- Update runbooks and alerting based on lessons learned
Severity Framework
| Level | Impact | Response Time | Examples |
|---|---|---|---|
| P0 | Complete outage, data loss, security breach | Immediate (< 5 min) | Service down, data corruption, credential leak |
| P1 | Major feature broken, significant user impact | < 30 min | Payment processing failed, auth broken for region |
| P2 | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| P3 | Minor issue, workaround available | Next business day | UI glitch, slow report generation, cosmetic error |
Output
- Incident timeline and severity classification
- Containment actions taken
- Postmortem document with action items
- Updated runbooks and alerting rules
Common Mistakes
- Skipping severity classification and treating everything as P0
- Making changes without a rollback plan
- Forgetting to communicate status to stakeholders
- Writing postmortems that assign blame instead of identifying systemic issues
- Not following up on postmortem action items
Triage Framework
Severity classification, cascade prevention, communication protocols, and escalation paths for production incidents. Use during active incidents or when establishing incident response procedures.
Severity Classification (P0-P3)
P0 -- Critical
Definition: Complete service outage, active data loss, or security breach affecting all users.
| Attribute | Requirement |
|---|---|
| Response time | < 5 minutes |
| Incident commander | Required (senior engineer or SRE) |
| Communication cadence | Every 15 minutes to stakeholders |
| War room | Immediately opened |
| Escalation | VP/Director notified within 15 minutes |
| Postmortem | Required within 48 hours |
Examples:
- Production database unreachable
- Authentication service completely down
- Active data corruption or loss
- Security breach with confirmed exfiltration
- Payment processing halted
P1 -- High
Definition: Major feature broken or significant degradation affecting a large subset of users.
| Attribute | Requirement |
|---|---|
| Response time | < 30 minutes |
| Incident commander | Required |
| Communication cadence | Every 30 minutes to stakeholders |
| War room | Opened if not resolved in 30 minutes |
| Escalation | Manager notified within 30 minutes |
| Postmortem | Required within 1 week |
Examples:
- Payment processing failing for one region
- Search functionality returning errors for 20%+ of queries
- API latency 10x above normal
- Mobile app crash on launch for specific OS version
P2 -- Medium
Definition: Degraded performance or partial feature loss with workarounds available.
| Attribute | Requirement |
|---|---|
| Response time | < 4 hours |
| Incident commander | Optional (on-call engineer handles) |
| Communication cadence | Status update at start and resolution |
| War room | Not required |
| Escalation | If unresolved after 8 hours |
| Postmortem | Recommended |
Examples:
- Elevated latency (2-3x normal) on non-critical endpoints
- Background job processing delayed
- Non-critical third-party integration down
- Report generation slow but functional
P3 -- Low
Definition: Minor issue with minimal user impact. Workaround exists or issue is cosmetic.
| Attribute | Requirement |
|---|---|
| Response time | Next business day |
| Incident commander | Not required |
| Communication cadence | Ticket update on resolution |
| War room | Not required |
| Escalation | Not required |
| Postmortem | Not required |
Examples:
- UI rendering glitch in edge case
- Non-critical cron job failed (will retry)
- Slow dashboard load for internal tool
- Minor logging error that does not affect functionality
Severity Decision Tree
Is data being lost or corrupted?
├─ Yes → P0
└─ No
Is there a security breach?
├─ Yes → P0
└─ No
Is the primary service completely down?
├─ Yes → P0
└─ No
Is a major feature broken for many users?
├─ Yes → P1
└─ No
Is performance significantly degraded?
├─ Yes → P2
└─ No → P3
Cascade Prevention
Circuit Breakers
Automatically stop calling a failing dependency to prevent cascading failure.
Implementation checklist:
- Every external dependency has a circuit breaker
- Failure thresholds are tuned per dependency (not one-size-fits-all)
- OPEN state returns a meaningful fallback (cached data, degraded response, error)
- HALF-OPEN probes are lightweight (health check, not full request)
- Circuit breaker state is observable (metrics, dashboard)
- Alerts fire when a circuit breaker opens
Configuration template:
Dependency: [service name]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached response | error message | degraded mode]
Bulkhead Isolation
Partition resources so failure in one area cannot exhaust resources for another.
Patterns:
- Thread pool isolation: Separate thread pools per dependency
- Connection pool isolation: Dedicated connection pools per downstream service
- Process isolation: Critical and non-critical workloads in separate processes
- Infrastructure isolation: Separate clusters for critical vs batch workloads
Checklist:
- Critical path dependencies have dedicated resource pools
- Non-critical background work cannot starve critical request handling
- Resource limits are set per pool (max connections, max threads)
- Pool exhaustion triggers alerts, not silent queuing
Load Shedding
Intentionally drop low-priority work to preserve capacity for high-priority traffic.
Priority tiers:
| Priority | Traffic Type | Shed When |
|---|---|---|
| Critical | Health checks, authentication | Never |
| High | Core user requests | > 95% capacity |
| Medium | Secondary features, analytics | > 80% capacity |
| Low | Background jobs, prefetch | > 70% capacity |
Implementation:
- Use request priority headers or path-based classification
- Return 503 with Retry-After header for shed requests
- Monitor shed rate as a metric (shedding > 0 is an alert)
Graceful Degradation Strategies
| Strategy | Description | Example |
|---|---|---|
| Feature flags | Disable non-critical features | Turn off recommendations during high load |
| Cached fallback | Serve stale data | Show cached search results when search service is down |
| Read-only mode | Disable writes | Allow browsing but not purchasing during payment outage |
| Static fallback | Serve pre-generated content | Show static landing page when CMS is down |
| Queue and retry | Accept but defer processing | Accept orders, process when backend recovers |
Communication Protocols
Status Page Updates
Template for status page entry:
[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]
Impact: [Brief description of user-visible impact]
Current status: [What we know and what we're doing]
Next update: [When to expect the next update]
Update cadence:
- P0: Every 15 minutes until resolved
- P1: Every 30 minutes until resolved
- P2: At start and resolution
- P3: At resolution only
Stakeholder Notification Template
Subject: [P0/P1] [Service] - [Brief impact description]
Severity: P[0-3]
Start time: [ISO 8601 timestamp]
Impact: [Who is affected and how]
Current status: [What we know]
Actions taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
Next update: [When]
Incident commander: [Name]
War room: [Link/channel]
Internal Communication Rules
- One source of truth: All updates go through the incident channel, not DMs
- Facts, not speculation: Share what you know, flag what you suspect
- Timestamp everything: Every action and observation gets a timestamp
- No blame: Focus on what happened, not who caused it
- Clear handoffs: When rotating, explicitly hand off context
Escalation Paths
Escalation Triggers
| Condition | Action |
|---|---|
| P0 not acknowledged in 5 min | Page backup on-call |
| P0/P1 not mitigated in 30 min | Escalate to engineering manager |
| P0 not resolved in 1 hour | Escalate to VP/Director |
| Any severity affecting revenue | Notify finance and business stakeholders |
| Security incident confirmed | Notify security team and legal |
| Data breach suspected | Invoke data breach response plan |
Escalation Checklist
- Primary on-call paged and acknowledged
- If no acknowledgment in 5 min, secondary on-call paged
- Incident commander assigned
- Relevant team leads notified
- Status page updated
- Customer support briefed with talking points
- Executive stakeholders notified (P0/P1 only)
On-Call Responsibilities
During incident:
- Acknowledge page within 5 minutes
- Assess severity and open incident channel
- Begin investigation and document findings in real time
- Coordinate with other teams as needed
- Provide status updates at the required cadence
After incident:
- Ensure monitoring confirms resolution
- Draft incident timeline
- Schedule postmortem if required
- Update runbooks with any new learnings
- Hand off to next on-call if shift ends during incident
Postmortem Patterns
Blameless postmortem structure, root cause analysis techniques, action item tracking, and chaos engineering patterns. Use after incident resolution or when designing resilience testing programs.
Blameless Postmortem Structure
Core Principles
- No blame: Focus on systems, processes, and conditions -- not individuals
- Assume good intent: Everyone involved was doing their best with the information available
- Learn, don't punish: The goal is prevention, not accountability
- Share widely: Postmortems are organizational learning, not team shame
Postmortem Document Template
# Incident Postmortem: [Title]
**Date:** [Incident date]
**Severity:** P[0-3]
**Duration:** [Start time] to [End time] ([total duration])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** [Draft | Review | Final]
## Summary
[1-2 sentence description of what happened and the user impact]
## Impact
- **Users affected:** [Number or percentage]
- **Duration:** [How long users experienced the issue]
- **Revenue impact:** [If applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA violations]
## Timeline
All times in [timezone].
| Time | Event |
|------|-------|
| HH:MM | [First signal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [Severity classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |
## Root Cause
[Detailed description of the root cause. What condition or change led to the failure?]
## Contributing Factors
- [Factor 1: e.g., missing monitoring for this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automated rollback configured]
## What Went Well
- [Thing 1: e.g., alert fired within 2 minutes of impact]
- [Thing 2: e.g., team coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth and fast]
## What Went Poorly
- [Thing 1: e.g., took 20 minutes to identify the failing service]
- [Thing 2: e.g., no runbook existed for this failure mode]
- [Thing 3: e.g., status page was not updated for 30 minutes]
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Action description] | [Name] | P1 | [Date] | Open |
| 2 | [Action description] | [Name] | P2 | [Date] | Open |
## Lessons Learned
[Key takeaways that should inform future design, process, or tooling decisions]
Postmortem Meeting Facilitation
Before the meeting:
- Draft the postmortem document and share 24 hours in advance
- All participants review the timeline for accuracy
- Incident commander prepares the root cause analysis
During the meeting (60-90 min):
- Timeline review (15 min): Walk through events, correct errors, fill gaps
- Root cause discussion (20 min): Apply 5 Whys or fishbone analysis
- Contributing factors (15 min): What made the incident worse or harder to resolve?
- What went well (10 min): Reinforce effective practices
- Action items (20 min): Define concrete, assignable, time-bounded actions
After the meeting:
- Finalize the document within 24 hours
- Distribute to the broader organization
- Enter action items into the tracking system
- Schedule follow-up review for action item completion
Root Cause Analysis Techniques
5 Whys
Repeatedly ask "why" to drill past symptoms to the underlying cause.
Example:
Problem: Users received duplicate order confirmation emails.
Why 1: The email service sent the confirmation twice.
Why 2: The order completion event was published twice.
Why 3: The order service retried after a timeout.
Why 4: The message broker acknowledged slowly under load.
Why 5: The broker's disk was 95% full, causing write delays.
Root cause: No disk usage monitoring or alerting on the message broker.
Action: Add disk usage alerting at 80% threshold + auto-scaling.
Guidelines:
- Stop when you reach a systemic cause you can fix (process, tooling, design)
- Do not stop at "human error" -- ask why the system allowed the error
- Some incidents have multiple root causes; run 5 Whys for each branch
- Answers should be factual, not speculative
Fishbone Diagram (Ishikawa)
Categorize contributing factors across standard dimensions.
┌─ People: On-call unfamiliar with service
├─ Process: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
├─ Environment: Broker disk at 95%
├─ Monitoring: No disk usage alerts
└─ External: Upstream traffic spike
Standard categories:
- People: Knowledge gaps, staffing, communication
- Process: Missing runbooks, unclear procedures, approval bottlenecks
- Technology: Bugs, missing features, architectural gaps
- Environment: Infrastructure, capacity, configuration
- Monitoring: Missing alerts, incorrect thresholds, observability gaps
- External: Third-party outages, traffic spikes, attacks
Fault Tree Analysis
Work backward from the failure to identify all possible causes.
Top event: Service outage
├── AND: Load balancer failure
│ ├── OR: Config error
│ └── OR: Health check misconfigured
└── AND: No failover triggered
├── OR: Failover not configured
└── OR: Failover health check also failed
When to use: Complex incidents with multiple interacting failures where 5 Whys is insufficient.
Action Item Tracking
Action Item Quality Criteria
Every action item must be:
- Specific: Clear description of what to do (not "improve monitoring")
- Assignable: One owner, not a team
- Time-bounded: Due date, not "when we get to it"
- Verifiable: Clear definition of done
- Prioritized: P1 (before next on-call rotation), P2 (this sprint), P3 (this quarter)
Action Item Categories
| Category | Description | Examples |
|---|---|---|
| Detection | Improve ability to notice the problem | Add alert, improve dashboard |
| Prevention | Stop the problem from occurring | Fix bug, add validation, improve architecture |
| Mitigation | Reduce impact when it happens | Add circuit breaker, improve rollback, write runbook |
| Process | Improve team response | Update on-call procedures, conduct training |
Tracking and Follow-Up
- Enter all action items into the team's issue tracker immediately
- Tag with
postmortemand incident ID for traceability - Review open postmortem action items weekly in team standup
- Escalate overdue P1 items to engineering manager
- Close action items only when verified complete (not just "code merged")
Action Item Anti-Patterns
| Anti-Pattern | Problem | Better Alternative |
|---|---|---|
| "Be more careful" | Not actionable | Automate the check |
| "Improve monitoring" | Too vague | "Add alert for X metric when > Y for Z minutes" |
| "No owner assigned" | Will not get done | Assign a specific person |
| "Due: TBD" | Will be deprioritized | Set a concrete date |
| "Add more tests" | Unbounded | "Add regression test for this specific failure mode" |
Chaos Engineering Patterns
Fault Injection
Intentionally introduce failures to verify resilience.
Common fault types:
| Fault | Tool/Method | Validates |
|---|---|---|
| Kill service instance | Process kill, pod delete | Auto-restart, health checks |
| Network latency | tc netem, Toxiproxy | Timeout handling, circuit breakers |
| Network partition | iptables, DNS override | Failover, split-brain handling |
| Disk full | fallocate, dd | Graceful degradation, alerting |
| CPU exhaustion | stress-ng | Autoscaling, load shedding |
| Dependency failure | Mock returning 500s | Fallback paths, error handling |
| Clock skew | chrony offset | Time-dependent logic |
Fault Injection Checklist
- Hypothesis defined: "We believe [X] will happen when [fault]"
- Blast radius limited (single instance, canary, staging)
- Rollback mechanism ready (kill switch for the experiment)
- Monitoring in place to observe the effect
- Team is aware the experiment is running
- Abort criteria defined (stop if real user impact exceeds N%)
Game Days
Structured exercises where teams practice incident response against simulated failures.
Game Day Planning Template:
## Game Day: [Title]
**Date:** [Date and time]
**Duration:** [Expected duration]
**Facilitator:** [Name]
**Participants:** [Team members]
### Scenario
[Description of the simulated incident]
### Objectives
- [ ] Validate alerting detects the failure within [N] minutes
- [ ] Validate team can triage to correct severity
- [ ] Validate mitigation can be applied within [N] minutes
- [ ] Validate communication protocols are followed
### Ground Rules
- This is practice, not evaluation
- Facilitator controls the scenario progression
- Anyone can call "stop" if real production impact is detected
- Document all observations in real time
### Debrief Questions
1. Did alerts fire as expected?
2. Was the right team engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incident?
Game day cadence:
- Quarterly for critical services
- After major architecture changes
- When onboarding new on-call engineers
- After any P0 incident (test the fixes)
Resilience Testing Methodology
Resilience Maturity Levels
| Level | Description | Activities |
|---|---|---|
| 1 - Reactive | Fix failures after they happen | Postmortems, basic monitoring |
| 2 - Aware | Know where failures could happen | Failure mode analysis, risk registry |
| 3 - Proactive | Test for failures before they happen | Chaos experiments in staging |
| 4 - Continuous | Regularly validate resilience in production | Automated chaos, game days |
| 5 - Anti-fragile | Systems improve through failure | Feedback loops, auto-remediation |
Resilience Testing Checklist
For each critical service, validate:
- Single instance failure: Service recovers when one instance dies
- Dependency timeout: Service handles slow dependencies gracefully
- Dependency outage: Service degrades (not crashes) when dependency is down
- Network partition: Service handles split-brain scenarios
- Load spike: Service sheds load or scales under 3x normal traffic
- Disk full: Service alerts and degrades before crashing
- Configuration error: Service fails fast with clear error on bad config
- Rollback: Previous version can be deployed within 5 minutes
- Data corruption: Backup restore has been tested within the last quarter
Steady-State Hypothesis
Before running any chaos experiment, define what "normal" looks like:
Steady state:
- Request success rate > 99.9%
- p99 latency < 200ms
- Error rate < 0.1%
- No alerts firing
Experiment: Kill 1 of 3 service instances
Hypothesis: Steady state metrics remain within 10% of baseline
within 60 seconds of the fault injection.
Abort if: Error rate > 5% for more than 30 seconds.