ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
WebsiteLoginSign Up
ElasticFlow

Transform your business with AI-powered workflow automation. One unified platform for all your enterprise needs.

Follow us

Platform

  • Features
  • Benefits
  • Use Cases
  • Workflow Library

Use Cases

  • Sales
  • Marketing
  • Finance & Legal
  • HR

Catalogue

  • Departments
  • Roles
  • Tools
  • Metrics
  • Platforms

Growth

  • Referral Program
  • Partners

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Acceptable Use
  • Security
  • SLA

© 2026 ElasticFlow. All rights reserved.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
WebsiteLoginSign Up
ElasticFlow

Transform your business with AI-powered workflow automation. One unified platform for all your enterprise needs.

Follow us

Platform

  • Features
  • Benefits
  • Use Cases
  • Workflow Library

Use Cases

  • Sales
  • Marketing
  • Finance & Legal
  • HR

Catalogue

  • Departments
  • Roles
  • Tools
  • Metrics
  • Platforms

Growth

  • Referral Program
  • Partners

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Acceptable Use
  • Security
  • SLA

© 2026 ElasticFlow. All rights reserved.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
WebsiteLoginSign Up
  1. Hub
  2. All Skills
  3. Incident Response
Available in:🇬🇧 English🇫🇷 Français
AI SkillManage incidentOperations

Classify incidents, coordinate updates, and turn the response into follow-up work. — Claude Skill

A Claude Skill for Claude Code by NickCrew — run /incident-response in Claude·Updated Jun 13, 2026·vmain@1d565c1

Compatible withGChatGPTClaudeClaudeCCClaude CodeXCodex / Codex CLICursorCursorGeminiGemini

Helps teams handle outages or service degradations with severity, owner, status cadence, containment actions, customer-safe updates, stakeholder updates, and blameless postmortem actions.

  • Classifies severity so the team knows whether this is P0, P1, P2, or P3.
  • Defines who owns the incident, how often to update people, and what channel is the source of truth.
  • Turns noisy Slack, ticket, alert, and customer notes into customer-safe status updates.
  • Separates immediate containment, customer communication, internal coordination, and post-incident follow-up.
  • Creates postmortem actions with owners, due dates, and prevention checks after the incident is resolved.
YouToday

Support, engineering, and leadership discuss an incident in scattered channels while customers receive late or inconsistent updates.

With /incident-response

Run /incident-response to classify severity, assign an owner, set update cadence, draft status updates, and capture the postmortem trail.

1 Paste incident facts and timeline2 Classify severity and impact3 Draft updates and actions4 Turn the timeline into postmortem follow-up

Who this is for

Support Lead

Turn incident facts into clear severity, customer updates, escalation paths, and follow-up actions.

See skills for this role
Project Manager

Coordinate owners, timeline, next updates, and post-incident action items across teams.

See skills for this role

What it does

Active incident update

Turn scattered incident facts into a clear status update for customers, support, and leadership.

Severity and ownership

Decide how serious the incident is, who owns coordination, and when the next update is due.

Postmortem preparation

Convert the response timeline into root cause, contributing factors, and concrete action items.

How it works

1

Collect symptoms, user impact, affected services, timeline, current owner, and any customer reports.

2

Classify severity and choose the communication cadence.

3

Separate immediate containment from deeper root-cause work.

4

Draft customer, support, and leadership updates that say what is known, what is not known, and when the next update will arrive.

5

Record the incident timeline and action tracker while facts are changing.

6

After resolution, produce a blameless postmortem outline with specific follow-up actions.

Input options

Symptoms and impact

What users are experiencing, how many are affected, and whether data, revenue, or security is involved.

Example

Incident timeline
09:04 Support reports checkout failures from 18 customers.
09:07 Payments dashboard shows authorization errors up 32%.
09:10 Engineering suspects fraud-rule rollout.
09:14 Sales says two enterprise trials are blocked.
Impact: Safari users in US/EU. No data loss. Workaround: another browser may work.
Incident response artifact
Severity and command
| Field | Decision |
|---|---|
| Severity | P1 - revenue path degraded for a meaningful user segment |
| Incident commander | Support Lead until Payments Engineering names technical lead |
| Source of truth | #inc-checkout-errors |
| Update cadence | Every 30 minutes until mitigation |
| Linked issue | PAY-1842 |
Customer update
We are investigating elevated checkout errors affecting some Safari users in the US and EU. Some customers may be able to complete checkout in another browser while we isolate the cause. Next update by 09:45 UTC.
Action tracker
| Time | Action | Owner | Status |
|---|---|---|---|
| Now | Pause or roll back fraud-rule rollout | Payments Eng | In progress |
| Now | Link all Zendesk tickets to PAY-1842 | Support | Needed |
| Next | Monitor success rate by browser and region | Analytics | Needed |
Postmortem seed
| Question | Follow-up candidate |
|---|---|
| Why did browser-specific failures escape rollout checks? | Add checkout alerts by browser and region |
| Why did Sales hear first from trials? | Add enterprise-trial notification path during P1 incidents |

Metrics this improves

Ticket Cycle Time
Helps support and engineering move urgent incidents through ownership and next actions faster.
Operations
Issue Hygiene
Turns incident notes into clear bugs, follow-up actions, owners, and due dates.
Operations

Works with

Slack
manual

Use incident channels, updates, and responder notes as the timeline source.

Jira
manual

Track incident follow-up actions, bugs, and postmortem tasks.

Zendesk
manual

Use customer reports and support tickets to understand user impact.

Want to use Incident Response?

Choose how to get started.

Run in Claude Code
Free. Open source.

Install and run this skill locally on your computer.

1
Install Claude Code

Open a terminal on your computer and paste this command:

2
Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

3
Run it

Start Claude Code, then type the command:

then
View source on GitHub
Use on ElasticFlow
Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

  • Production incident in progress (outage, degradation, data loss)
  • Designing circuit breakers, bulkheads, or fallback strategies
  • Conducting or planning chaos engineering exercises
  • Writing or reviewing postmortem documents
  • Establishing on-call procedures and escalation paths

Avoid when:

  • The issue is a development-time bug with no production impact
  • Designing general system architecture (use system-design instead)

Quick Reference

TopicLoad reference
Triage Frameworkskills/incident-response/references/triage-framework.md
Postmortem Patternsskills/incident-response/references/postmortem-patterns.md

Incident Response Workflow

Phase 1: Detect

  • Alert fires or user report received
  • Confirm the issue is real (not a false positive)
  • Identify affected services and user impact scope

Phase 2: Triage

  • Classify severity (P0-P3)
  • Assign incident commander
  • Open communication channel (war room, Slack channel)
  • Begin status page updates

Phase 3: Contain

  • Stop the bleeding: rollback, feature flag, traffic shift
  • Prevent cascade: circuit breakers, load shedding, bulkhead isolation
  • Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

  • Implement fix (minimal viable fix first)
  • Validate in staging if time permits
  • Deploy with monitoring and rollback plan ready
  • Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

  • Document timeline within 48 hours
  • Conduct blameless review with all participants
  • Identify root cause and contributing factors
  • Assign action items with owners and deadlines
  • Update runbooks and alerting based on lessons learned

Severity Framework

LevelImpactResponse TimeExamples
P0Complete outage, data loss, security breachImmediate (< 5 min)Service down, data corruption, credential leak
P1Major feature broken, significant user impact< 30 minPayment processing failed, auth broken for region
P2Degraded performance, partial feature loss< 4 hoursElevated latency, non-critical feature unavailable
P3Minor issue, workaround availableNext business dayUI glitch, slow report generation, cosmetic error

Output

  • Incident timeline and severity classification
  • Containment actions taken
  • Postmortem document with action items
  • Updated runbooks and alerting rules

Common Mistakes

  • Skipping severity classification and treating everything as P0
  • Making changes without a rollback plan
  • Forgetting to communicate status to stakeholders
  • Writing postmortems that assign blame instead of identifying systemic issues
  • Not following up on postmortem action items

Reference documents


name: incident-response description: Incident triage, cascade prevention, and postmortem methodology. Use when handling production incidents, designing resilience patterns, or conducting chaos engineering exercises. keywords:

  • incident response
  • outage
  • postmortem
  • triage
  • incident
  • response

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

  • Production incident in progress (outage, degradation, data loss)
  • Designing circuit breakers, bulkheads, or fallback strategies
  • Conducting or planning chaos engineering exercises
  • Writing or reviewing postmortem documents
  • Establishing on-call procedures and escalation paths

Avoid when:

  • The issue is a development-time bug with no production impact
  • Designing general system architecture (use system-design instead)

Quick Reference

TopicLoad reference
Triage Frameworkskills/incident-response/references/triage-framework.md
Postmortem Patternsskills/incident-response/references/postmortem-patterns.md

Incident Response Workflow

Phase 1: Detect

  • Alert fires or user report received
  • Confirm the issue is real (not a false positive)
  • Identify affected services and user impact scope

Phase 2: Triage

  • Classify severity (P0-P3)
  • Assign incident commander
  • Open communication channel (war room, Slack channel)
  • Begin status page updates

Phase 3: Contain

  • Stop the bleeding: rollback, feature flag, traffic shift
  • Prevent cascade: circuit breakers, load shedding, bulkhead isolation
  • Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

  • Implement fix (minimal viable fix first)
  • Validate in staging if time permits
  • Deploy with monitoring and rollback plan ready
  • Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

  • Document timeline within 48 hours
  • Conduct blameless review with all participants
  • Identify root cause and contributing factors
  • Assign action items with owners and deadlines
  • Update runbooks and alerting based on lessons learned

Severity Framework

LevelImpactResponse TimeExamples
P0Complete outage, data loss, security breachImmediate (< 5 min)Service down, data corruption, credential leak
P1Major feature broken, significant user impact< 30 minPayment processing failed, auth broken for region
P2Degraded performance, partial feature loss< 4 hoursElevated latency, non-critical feature unavailable
P3Minor issue, workaround availableNext business dayUI glitch, slow report generation, cosmetic error

Output

  • Incident timeline and severity classification
  • Containment actions taken
  • Postmortem document with action items
  • Updated runbooks and alerting rules

Common Mistakes

  • Skipping severity classification and treating everything as P0
  • Making changes without a rollback plan
  • Forgetting to communicate status to stakeholders
  • Writing postmortems that assign blame instead of identifying systemic issues
  • Not following up on postmortem action items

Triage Framework

Severity classification, cascade prevention, communication protocols, and escalation paths for production incidents. Use during active incidents or when establishing incident response procedures.

Severity Classification (P0-P3)

P0 -- Critical

Definition: Complete service outage, active data loss, or security breach affecting all users.

AttributeRequirement
Response time< 5 minutes
Incident commanderRequired (senior engineer or SRE)
Communication cadenceEvery 15 minutes to stakeholders
War roomImmediately opened
EscalationVP/Director notified within 15 minutes
PostmortemRequired within 48 hours

Examples:

  • Production database unreachable
  • Authentication service completely down
  • Active data corruption or loss
  • Security breach with confirmed exfiltration
  • Payment processing halted

P1 -- High

Definition: Major feature broken or significant degradation affecting a large subset of users.

AttributeRequirement
Response time< 30 minutes
Incident commanderRequired
Communication cadenceEvery 30 minutes to stakeholders
War roomOpened if not resolved in 30 minutes
EscalationManager notified within 30 minutes
PostmortemRequired within 1 week

Examples:

  • Payment processing failing for one region
  • Search functionality returning errors for 20%+ of queries
  • API latency 10x above normal
  • Mobile app crash on launch for specific OS version

P2 -- Medium

Definition: Degraded performance or partial feature loss with workarounds available.

AttributeRequirement
Response time< 4 hours
Incident commanderOptional (on-call engineer handles)
Communication cadenceStatus update at start and resolution
War roomNot required
EscalationIf unresolved after 8 hours
PostmortemRecommended

Examples:

  • Elevated latency (2-3x normal) on non-critical endpoints
  • Background job processing delayed
  • Non-critical third-party integration down
  • Report generation slow but functional

P3 -- Low

Definition: Minor issue with minimal user impact. Workaround exists or issue is cosmetic.

AttributeRequirement
Response timeNext business day
Incident commanderNot required
Communication cadenceTicket update on resolution
War roomNot required
EscalationNot required
PostmortemNot required

Examples:

  • UI rendering glitch in edge case
  • Non-critical cron job failed (will retry)
  • Slow dashboard load for internal tool
  • Minor logging error that does not affect functionality

Severity Decision Tree

Is data being lost or corrupted?
├─ Yes → P0
└─ No
   Is there a security breach?
   ├─ Yes → P0
   └─ No
      Is the primary service completely down?
      ├─ Yes → P0
      └─ No
         Is a major feature broken for many users?
         ├─ Yes → P1
         └─ No
            Is performance significantly degraded?
            ├─ Yes → P2
            └─ No → P3

Cascade Prevention

Circuit Breakers

Automatically stop calling a failing dependency to prevent cascading failure.

Implementation checklist:

  • Every external dependency has a circuit breaker
  • Failure thresholds are tuned per dependency (not one-size-fits-all)
  • OPEN state returns a meaningful fallback (cached data, degraded response, error)
  • HALF-OPEN probes are lightweight (health check, not full request)
  • Circuit breaker state is observable (metrics, dashboard)
  • Alerts fire when a circuit breaker opens

Configuration template:

Dependency: [service name]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached response | error message | degraded mode]

Bulkhead Isolation

Partition resources so failure in one area cannot exhaust resources for another.

Patterns:

  • Thread pool isolation: Separate thread pools per dependency
  • Connection pool isolation: Dedicated connection pools per downstream service
  • Process isolation: Critical and non-critical workloads in separate processes
  • Infrastructure isolation: Separate clusters for critical vs batch workloads

Checklist:

  • Critical path dependencies have dedicated resource pools
  • Non-critical background work cannot starve critical request handling
  • Resource limits are set per pool (max connections, max threads)
  • Pool exhaustion triggers alerts, not silent queuing

Load Shedding

Intentionally drop low-priority work to preserve capacity for high-priority traffic.

Priority tiers:

PriorityTraffic TypeShed When
CriticalHealth checks, authenticationNever
HighCore user requests> 95% capacity
MediumSecondary features, analytics> 80% capacity
LowBackground jobs, prefetch> 70% capacity

Implementation:

  • Use request priority headers or path-based classification
  • Return 503 with Retry-After header for shed requests
  • Monitor shed rate as a metric (shedding > 0 is an alert)

Graceful Degradation Strategies

StrategyDescriptionExample
Feature flagsDisable non-critical featuresTurn off recommendations during high load
Cached fallbackServe stale dataShow cached search results when search service is down
Read-only modeDisable writesAllow browsing but not purchasing during payment outage
Static fallbackServe pre-generated contentShow static landing page when CMS is down
Queue and retryAccept but defer processingAccept orders, process when backend recovers

Communication Protocols

Status Page Updates

Template for status page entry:

[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]

Impact: [Brief description of user-visible impact]
Current status: [What we know and what we're doing]
Next update: [When to expect the next update]

Update cadence:

  • P0: Every 15 minutes until resolved
  • P1: Every 30 minutes until resolved
  • P2: At start and resolution
  • P3: At resolution only

Stakeholder Notification Template

Subject: [P0/P1] [Service] - [Brief impact description]

Severity: P[0-3]
Start time: [ISO 8601 timestamp]
Impact: [Who is affected and how]
Current status: [What we know]
Actions taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
Next update: [When]
Incident commander: [Name]
War room: [Link/channel]

Internal Communication Rules

  1. One source of truth: All updates go through the incident channel, not DMs
  2. Facts, not speculation: Share what you know, flag what you suspect
  3. Timestamp everything: Every action and observation gets a timestamp
  4. No blame: Focus on what happened, not who caused it
  5. Clear handoffs: When rotating, explicitly hand off context

Escalation Paths

Escalation Triggers

ConditionAction
P0 not acknowledged in 5 minPage backup on-call
P0/P1 not mitigated in 30 minEscalate to engineering manager
P0 not resolved in 1 hourEscalate to VP/Director
Any severity affecting revenueNotify finance and business stakeholders
Security incident confirmedNotify security team and legal
Data breach suspectedInvoke data breach response plan

Escalation Checklist

  • Primary on-call paged and acknowledged
  • If no acknowledgment in 5 min, secondary on-call paged
  • Incident commander assigned
  • Relevant team leads notified
  • Status page updated
  • Customer support briefed with talking points
  • Executive stakeholders notified (P0/P1 only)

On-Call Responsibilities

During incident:

  • Acknowledge page within 5 minutes
  • Assess severity and open incident channel
  • Begin investigation and document findings in real time
  • Coordinate with other teams as needed
  • Provide status updates at the required cadence

After incident:

  • Ensure monitoring confirms resolution
  • Draft incident timeline
  • Schedule postmortem if required
  • Update runbooks with any new learnings
  • Hand off to next on-call if shift ends during incident

Postmortem Patterns

Blameless postmortem structure, root cause analysis techniques, action item tracking, and chaos engineering patterns. Use after incident resolution or when designing resilience testing programs.

Blameless Postmortem Structure

Core Principles

  1. No blame: Focus on systems, processes, and conditions -- not individuals
  2. Assume good intent: Everyone involved was doing their best with the information available
  3. Learn, don't punish: The goal is prevention, not accountability
  4. Share widely: Postmortems are organizational learning, not team shame

Postmortem Document Template

# Incident Postmortem: [Title]

**Date:** [Incident date]
**Severity:** P[0-3]
**Duration:** [Start time] to [End time] ([total duration])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** [Draft | Review | Final]

## Summary

[1-2 sentence description of what happened and the user impact]

## Impact

- **Users affected:** [Number or percentage]
- **Duration:** [How long users experienced the issue]
- **Revenue impact:** [If applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA violations]

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| HH:MM | [First signal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [Severity classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |

## Root Cause

[Detailed description of the root cause. What condition or change led to the failure?]

## Contributing Factors

- [Factor 1: e.g., missing monitoring for this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automated rollback configured]

## What Went Well

- [Thing 1: e.g., alert fired within 2 minutes of impact]
- [Thing 2: e.g., team coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth and fast]

## What Went Poorly

- [Thing 1: e.g., took 20 minutes to identify the failing service]
- [Thing 2: e.g., no runbook existed for this failure mode]
- [Thing 3: e.g., status page was not updated for 30 minutes]

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Action description] | [Name] | P1 | [Date] | Open |
| 2 | [Action description] | [Name] | P2 | [Date] | Open |

## Lessons Learned

[Key takeaways that should inform future design, process, or tooling decisions]

Postmortem Meeting Facilitation

Before the meeting:

  • Draft the postmortem document and share 24 hours in advance
  • All participants review the timeline for accuracy
  • Incident commander prepares the root cause analysis

During the meeting (60-90 min):

  1. Timeline review (15 min): Walk through events, correct errors, fill gaps
  2. Root cause discussion (20 min): Apply 5 Whys or fishbone analysis
  3. Contributing factors (15 min): What made the incident worse or harder to resolve?
  4. What went well (10 min): Reinforce effective practices
  5. Action items (20 min): Define concrete, assignable, time-bounded actions

After the meeting:

  • Finalize the document within 24 hours
  • Distribute to the broader organization
  • Enter action items into the tracking system
  • Schedule follow-up review for action item completion

Root Cause Analysis Techniques

5 Whys

Repeatedly ask "why" to drill past symptoms to the underlying cause.

Example:

Problem: Users received duplicate order confirmation emails.

Why 1: The email service sent the confirmation twice.
Why 2: The order completion event was published twice.
Why 3: The order service retried after a timeout.
Why 4: The message broker acknowledged slowly under load.
Why 5: The broker's disk was 95% full, causing write delays.

Root cause: No disk usage monitoring or alerting on the message broker.
Action: Add disk usage alerting at 80% threshold + auto-scaling.

Guidelines:

  • Stop when you reach a systemic cause you can fix (process, tooling, design)
  • Do not stop at "human error" -- ask why the system allowed the error
  • Some incidents have multiple root causes; run 5 Whys for each branch
  • Answers should be factual, not speculative

Fishbone Diagram (Ishikawa)

Categorize contributing factors across standard dimensions.

                    ┌─ People: On-call unfamiliar with service
                    ├─ Process: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
                    ├─ Environment: Broker disk at 95%
                    ├─ Monitoring: No disk usage alerts
                    └─ External: Upstream traffic spike

Standard categories:

  • People: Knowledge gaps, staffing, communication
  • Process: Missing runbooks, unclear procedures, approval bottlenecks
  • Technology: Bugs, missing features, architectural gaps
  • Environment: Infrastructure, capacity, configuration
  • Monitoring: Missing alerts, incorrect thresholds, observability gaps
  • External: Third-party outages, traffic spikes, attacks

Fault Tree Analysis

Work backward from the failure to identify all possible causes.

Top event: Service outage
├── AND: Load balancer failure
│   ├── OR: Config error
│   └── OR: Health check misconfigured
└── AND: No failover triggered
    ├── OR: Failover not configured
    └── OR: Failover health check also failed

When to use: Complex incidents with multiple interacting failures where 5 Whys is insufficient.

Action Item Tracking

Action Item Quality Criteria

Every action item must be:

  • Specific: Clear description of what to do (not "improve monitoring")
  • Assignable: One owner, not a team
  • Time-bounded: Due date, not "when we get to it"
  • Verifiable: Clear definition of done
  • Prioritized: P1 (before next on-call rotation), P2 (this sprint), P3 (this quarter)

Action Item Categories

CategoryDescriptionExamples
DetectionImprove ability to notice the problemAdd alert, improve dashboard
PreventionStop the problem from occurringFix bug, add validation, improve architecture
MitigationReduce impact when it happensAdd circuit breaker, improve rollback, write runbook
ProcessImprove team responseUpdate on-call procedures, conduct training

Tracking and Follow-Up

  • Enter all action items into the team's issue tracker immediately
  • Tag with postmortem and incident ID for traceability
  • Review open postmortem action items weekly in team standup
  • Escalate overdue P1 items to engineering manager
  • Close action items only when verified complete (not just "code merged")

Action Item Anti-Patterns

Anti-PatternProblemBetter Alternative
"Be more careful"Not actionableAutomate the check
"Improve monitoring"Too vague"Add alert for X metric when > Y for Z minutes"
"No owner assigned"Will not get doneAssign a specific person
"Due: TBD"Will be deprioritizedSet a concrete date
"Add more tests"Unbounded"Add regression test for this specific failure mode"

Chaos Engineering Patterns

Fault Injection

Intentionally introduce failures to verify resilience.

Common fault types:

FaultTool/MethodValidates
Kill service instanceProcess kill, pod deleteAuto-restart, health checks
Network latencytc netem, ToxiproxyTimeout handling, circuit breakers
Network partitioniptables, DNS overrideFailover, split-brain handling
Disk fullfallocate, ddGraceful degradation, alerting
CPU exhaustionstress-ngAutoscaling, load shedding
Dependency failureMock returning 500sFallback paths, error handling
Clock skewchrony offsetTime-dependent logic

Fault Injection Checklist

  • Hypothesis defined: "We believe [X] will happen when [fault]"
  • Blast radius limited (single instance, canary, staging)
  • Rollback mechanism ready (kill switch for the experiment)
  • Monitoring in place to observe the effect
  • Team is aware the experiment is running
  • Abort criteria defined (stop if real user impact exceeds N%)

Game Days

Structured exercises where teams practice incident response against simulated failures.

Game Day Planning Template:

## Game Day: [Title]

**Date:** [Date and time]
**Duration:** [Expected duration]
**Facilitator:** [Name]
**Participants:** [Team members]

### Scenario
[Description of the simulated incident]

### Objectives
- [ ] Validate alerting detects the failure within [N] minutes
- [ ] Validate team can triage to correct severity
- [ ] Validate mitigation can be applied within [N] minutes
- [ ] Validate communication protocols are followed

### Ground Rules
- This is practice, not evaluation
- Facilitator controls the scenario progression
- Anyone can call "stop" if real production impact is detected
- Document all observations in real time

### Debrief Questions
1. Did alerts fire as expected?
2. Was the right team engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incident?

Game day cadence:

  • Quarterly for critical services
  • After major architecture changes
  • When onboarding new on-call engineers
  • After any P0 incident (test the fixes)

Resilience Testing Methodology

Resilience Maturity Levels

LevelDescriptionActivities
1 - ReactiveFix failures after they happenPostmortems, basic monitoring
2 - AwareKnow where failures could happenFailure mode analysis, risk registry
3 - ProactiveTest for failures before they happenChaos experiments in staging
4 - ContinuousRegularly validate resilience in productionAutomated chaos, game days
5 - Anti-fragileSystems improve through failureFeedback loops, auto-remediation

Resilience Testing Checklist

For each critical service, validate:

  • Single instance failure: Service recovers when one instance dies
  • Dependency timeout: Service handles slow dependencies gracefully
  • Dependency outage: Service degrades (not crashes) when dependency is down
  • Network partition: Service handles split-brain scenarios
  • Load spike: Service sheds load or scales under 3x normal traffic
  • Disk full: Service alerts and degrades before crashing
  • Configuration error: Service fails fast with clear error on bad config
  • Rollback: Previous version can be deployed within 5 minutes
  • Data corruption: Backup restore has been tested within the last quarter

Steady-State Hypothesis

Before running any chaos experiment, define what "normal" looks like:

Steady state:
- Request success rate > 99.9%
- p99 latency < 200ms
- Error rate < 0.1%
- No alerts firing

Experiment: Kill 1 of 3 service instances

Hypothesis: Steady state metrics remain within 10% of baseline
            within 60 seconds of the fault injection.

Abort if: Error rate > 5% for more than 30 seconds.
ElasticFlow

Transform your business with AI-powered workflow automation. One unified platform for all your enterprise needs.

Follow us

Platform

  • Features
  • Benefits
  • Use Cases
  • Workflow Library

Use Cases

  • Sales
  • Marketing
  • Finance & Legal
  • HR

Catalogue

  • Departments
  • Roles
  • Tools
  • Metrics
  • Platforms

Growth

  • Referral Program
  • Partners

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Acceptable Use
  • Security
  • SLA

© 2026 ElasticFlow. All rights reserved.