Disponível em: English Français 한국어 Português

Skill de IAManage incidenteOperações

Classify incidentes, coordinate updates, e transformar o resposta em follow-up work. — Claude Skill

Name: Resposta a Incidentes
Author: NickCrew

Um Skill Claude para Claude Code por NickCrew — executar /incident-response no Claude·Atualizado em 18 de jun. de 2026·vmain@1d565c1

Compatível comChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Ajuda equipas a gerir outages ou degradações de serviço com severidade, responsável, cadência de estado, contenção, atualizações para clientes e postmortem sem culpa.

Classifies severidade so o equipa knows whether this is P0, P1, P2, ou P3.
Defines who owns o incidente, how often para update people, e what channel is o fonte de truth.
transforma noisy Slack, ticket, alert, e cliente notes em cliente-safe status updates.
Separates immediate containment, cliente communication, internal coordination, e post-incidente follow-up.
cria postmortem ações com responsáveis, due dates, e prevention verifica depois o incidente is resolved.

VocêHoje

suporte, engenharia, e leadership discuss an incidente in scattered channels while clientes receive late ou inconsistent updates.

Com /incident-response

Run /incident-response para classify severidade, atribuir an responsável, set update cadence, draft status updates, e capture o postmortem trail.

1 Paste incidente facts e timeline2 Classify severidade e impacto3 Draft updates e ações4 transformar o timeline em postmortem follow-up

Para quem é

Responsável de Suporte

transformar incidente facts em claro severidade, cliente updates, escalamento paths, e follow-up ações.

Ver skills para esta função

Gestor de Projeto

Coordinate responsáveis, timeline, próximo updates, e post-incidente ação items across equipas.

Ver skills para esta função

O que faz

Active incident update

transformar scattered incidente facts na claro atualização de estado para clientes, suporte, e leadership.

Severity e ownership

Decide how serious o incidente is, who owns coordination, e when o próximo update is due.

Postmortem preparation

converter o resposta timeline em root cause, contributing factors, e concrete ação items.

Como funciona

Collect symptoms, utilizador impacto, affected services, timeline, atual responsável, e qualquer cliente relatórios.

Classify severidade e choose o communication cadence.

Separate immediate containment a partir de deeper root-cause work.

Draft cliente, suporte, e leadership updates that say what is known, what is não known, e when o próximo update will arrive.

Record o incidente timeline e ação tracker while facts are changing.

depois resolução, produce a blameless postmortem outline com specific follow-up ações.

Opções de entrada

Symptoms e impact

What utilizadores are experiencing, how many are affected, e whether dados, revenue, ou security is involved.

Exemplo

O que o utilizador cola

09:04 suporte relatórios checkout failures a partir de 18 clientes.
09:07 payments dashboard shows card authorization errors up 32%.
09:10 engenharia suspects o novo fraud rule rollout.
09:14 vendas says two enterprise trials are blocked.
impacto: checkout fails para Safari utilizadores in US/EU. No dados loss. Workaround: retry on Chrome works para some utilizadores.
atual channels: #suporte-urgent, #payments-eng, Zendesk tickets, Jira bug PAY-1842.
precisar de: severidade, cliente update, responsável, internal update, próximo ações, e postmortem notes.

Resultado útil

How it reads the incident

Classify as P1: a revenue-critical path is degraded para a meaningful utilizador segment, but there is no total outage, security issue, ou dados loss. Treat Safari checkout as o affected cliente experience, não “todos payments down.”

Incident command

incidente commander: suporte lead until Payments engenharia nomeia a technical lead. fonte de truth: #inc-checkout-errors. Linked ticket: PAY-1842. Update cadence: cada 30 minutes until mitigação, then cada 60 minutes until resolved. próximo external update due 09:45 UTC.

Cliente-safe update

We are investigating elevated checkout errors affecting some Safari utilizadores in o US e EU. Some clientes may be able para complete checkout in another browser while we isolate o cause. We will share o próximo update by 09:45 UTC.

Internal stakeholder update

impacto: 18 reported clientes plus two blocked enterprise trials. Suspected gatilho: fraud-rule rollout. atual hypothesis is não confirmed. Payments engenharia is checking browser-specific authorization failures e preparing rollback if o rule is implicated.

Action tracker

Now: pause ou roll back o fraud rule rollout. responsável: Payments engenharia.
Now: link todos Zendesk tickets para PAY-1842. responsável: suporte.
próximo: monitorizar checkout success by browser e region. responsável: analytics/Payments.
próximo: preparar a saved suporte reply com workaround e próximo-update time. responsável: suporte lead.

Postmortem seed

Questions: why did o rollout reach Safari utilizadores sem browser-specific checkout monitoring? Why did vendas hear a partir de trials antes o equipa had a atualização de estado? Follow-up ações deve incluir segmented checkout alerts, rollout checklist update, e incidente-linking instructions para suporte.

Revisão humana

Confirm severidade, jurídico/compliance wording, e whether o workaround is safe para publish antes sending qualquer orientado ao cliente update.

Métricas que melhora

Tempo de ciclo de tickets

ajuda suporte e engenharia move urgent incidentes through propriedade e próximo ações faster.

Operações

Higiene de issues

transforma incidente notes em claro bugs, follow-up ações, responsáveis, e due dates.

Operações

Funciona com

Slack

manual

usar incidente channels, updates, e responder notes as o timeline fonte.

Jira

manual

Track incidente follow-up ações, bugs, e postmortem tasks.

Zendesk

manual

usar cliente relatórios e suporte tickets para understand utilizador impacto.

Quer usar Resposta a Incidentes?

Escolha como começar.

Executar no Claude Code

Gratuito. Código aberto.

Instale e execute este skill localmente no seu computador.

Instalar o Claude Code

Abra um terminal no seu computador e cole este comando:

Instalar o skill

Isto descarrega o skill com todos os ficheiros para o seu computador:

Adicione -g no fim para o tornar disponível em todos os seus projetos.

Execute

Inicie o Claude Code, depois escreva o comando:

depois

Ver código no GitHub

Usar no ElasticFlow

Funcionalidades de equipa e colaboração

Execute skills a partir do seu navegador. Partilhe resultados, gira acessos, colabore com a sua equipa. Sem terminal.

Teste grátis de 14 dias. Cancele a qualquer momento.

Ver no GitHub

incidente resposta

Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.

When para usar

Production incidente in progress (outage, degradation, dados loss)
Designing circuit breakers, bulkheads, ou fallback strategies
Conducting ou planning chaos engenharia exercises
Writing ou reviewing postmortem documents
Establishing on-call procedures e escalamento paths

Avoid when:

o issue is a development-time bug com no production impacto
Designing general system architecture (usar system-design instead)

Quick Reference

Topic	Load reference
triagem Framework	¤KEEP0¤
Postmortem Patterns	¤KEEP0¤

incidente resposta workflow

Phase 1: detetar

Alert fires ou utilizador relatório received
Confirm o issue is real (não a false positive)
Identify affected services e utilizador impacto scope

Phase 2: triagem

Classify severidade (P0-P3)
atribuir incidente commander
Open communication channel (war room, Slack channel)
Begin status página updates

Phase 3: Contain

Stop o bleeding: rollback, feature assinalar, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates cada 15 minutes para P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix primeiro)
validar in staging if time permits
Deploy com monitoring e rollback plano pronto
Confirm recovery com métricas returning para baseline

Phase 5: Postmortem

Document timeline dentro de 48 hours
Conduct blameless rever com todos participants
Identify root cause e contributing factors
atribuir ação items com responsáveis e deadlines
Update runbooks e alerting based on lessons learned

severidade Framework

Level	impacto	resposta Time	Examples
P0	Complete outage, dados loss, security breach	Immediate (< 5 min)	Service down, dados corruption, credential leak
P1	Major feature broken, significant utilizador impacto	< 30 min	Payment processing failed, auth broken para region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround disponível	próximo business day	UI glitch, slow relatório generation, cosmetic error

Output

incidente timeline e severidade classification
Containment ações taken
Postmortem document com ação items
Updated runbooks e alerting rules

Common Mistakes

Skipping severidade classification e treating everything as P0
Making changes sem a rollback plano
Forgetting para communicate status para stakeholders
Writing postmortems that atribuir blame instead de identifying systemic issues
não following up on postmortem ação items

Documentos de referência

name: incident-response description: incidente triagem, cascade prevention, e postmortem methodology. usar when handling production incidentes, designing resilience patterns, ou conducting chaos engenharia exercises. keywords:

incidente resposta
- outage
- postmortem
- triage
- incident
- response

incidente resposta

Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.

When para usar

Production incidente in progress (outage, degradation, dados loss)
Designing circuit breakers, bulkheads, ou fallback strategies
Conducting ou planning chaos engenharia exercises
Writing ou reviewing postmortem documents
Establishing on-call procedures e escalamento paths

Avoid when:

o issue is a development-time bug com no production impacto
Designing general system architecture (usar system-design instead)

Quick Reference

Topic	Load reference
triagem Framework	¤KEEP0¤
Postmortem Patterns	¤KEEP0¤

incidente resposta workflow

Phase 1: detetar

Alert fires ou utilizador relatório received
Confirm o issue is real (não a false positive)
Identify affected services e utilizador impacto scope

Phase 2: triagem

Classify severidade (P0-P3)
atribuir incidente commander
Open communication channel (war room, Slack channel)
Begin status página updates

Phase 3: Contain

Stop o bleeding: rollback, feature assinalar, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates cada 15 minutes para P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix primeiro)
validar in staging if time permits
Deploy com monitoring e rollback plano pronto
Confirm recovery com métricas returning para baseline

Phase 5: Postmortem

Document timeline dentro de 48 hours
Conduct blameless rever com todos participants
Identify root cause e contributing factors
atribuir ação items com responsáveis e deadlines
Update runbooks e alerting based on lessons learned

severidade Framework

Level	impacto	resposta Time	Examples
P0	Complete outage, dados loss, security breach	Immediate (< 5 min)	Service down, dados corruption, credential leak
P1	Major feature broken, significant utilizador impacto	< 30 min	Payment processing failed, auth broken para region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround disponível	próximo business day	UI glitch, slow relatório generation, cosmetic error

Output

incidente timeline e severidade classification
Containment ações taken
Postmortem document com ação items
Updated runbooks e alerting rules

Common Mistakes

Skipping severidade classification e treating everything as P0
Making changes sem a rollback plano
Forgetting para communicate status para stakeholders
Writing postmortems that atribuir blame instead de identifying systemic issues
não following up on postmortem ação items

triagem Framework

severidade classification, cascade prevention, communication protocols, e escalamento paths para production incidentes. usar during active incidentes ou when establishing incidente resposta procedures.

severidade Classification (P0-P3)

P0 -- Critical

Definition: Complete service outage, active dados loss, ou security breach affecting todos utilizadores.

Attribute	Requirement
resposta time	< 5 minutes
incidente commander	obrigatório (senior engineer ou SRE)
Communication cadence	cada 15 minutes para stakeholders
War room	Immediately opened
escalamento	VP/Director notified dentro de 15 minutes
Postmortem	obrigatório dentro de 48 hours

Examples:

Production database unreachable
Authentication service completely down
Active dados corruption ou loss
Security breach com confirmed exfiltration
Payment processing halted

P1 -- High

Definition: Major feature broken ou significant degradation affecting a large subset de utilizadores.

Attribute	Requirement
resposta time	< 30 minutes
incidente commander	obrigatório
Communication cadence	cada 30 minutes para stakeholders
War room	Opened if não resolved in 30 minutes
escalamento	Manager notified dentro de 30 minutes
Postmortem	obrigatório dentro de 1 week

Examples:

Payment processing failing para one region
pesquisar functionality returning errors para 20%+ de queries
API latency 10x above normal
Mobile app crash on lançar para specific OS version

P2 -- Medium

Definition: Degraded performance ou partial feature loss com workarounds disponível.

Attribute	Requirement
resposta time	< 4 hours
incidente commander	opcional (on-call engineer handles)
Communication cadence	atualização de estado at start e resolução
War room	não obrigatório
escalamento	If unresolved depois 8 hours
Postmortem	Recommended

Examples:

Elevated latency (2-3x normal) on non-critical endpoints
Background job processing delayed
Non-critical third-party integration down
relatório generation slow but functional

P3 -- Low

Definition: Minor issue com minimal utilizador impacto. Workaround exists ou issue is cosmetic.

Attribute	Requirement
resposta time	próximo business day
incidente commander	não obrigatório
Communication cadence	ticket update on resolução
War room	não obrigatório
escalamento	não obrigatório
Postmortem	não obrigatório

Examples:

UI rendering glitch in edge case
Non-critical cron job failed (will retry)
Slow dashboard load para internal tool
Minor logging error that does não affect functionality

severidade decisão Tree

Is dados being lost ou corrupted?
├─ Yes → P0
└─ No
Is there a security breach?
├─ Yes → P0
└─ No
Is o primary service completely down?
├─ Yes → P0
└─ No
Is a major feature broken para many utilizadores?
├─ Yes → P1
└─ No
Is performance significantly degraded?
├─ Yes → P2
└─ No → P3

Cascade Prevention

Circuit Breakers

Automatically stop calling a failing dependency para prevent cascading failure.

Implementation checklist:

cada external dependency has a circuit breaker
Failure thresholds are tuned per dependency (não one-size-fits-todos)
OPEN declarar returns a meaningful fallback (cached dados, degraded resposta, error)
HALF-OPEN probes are lightweight (health verificar, não full pedido)
Circuit breaker declarar is observable (métricas, dashboard)
Alerts fire when a circuit breaker opens

Configuration template:

Dependency: [service nomear]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached resposta | error mensagem | degraded mode]

Bulkhead Isolation

Partition resources so failure in one area cannot exhaust resources para another.

Patterns:

Thread pool isolation: Separate thread pools per dependency
Connection pool isolation: Dedicated connection pools per downstream service
processo isolation: Critical e non-critical workloads in separate processos
Infrastructure isolation: Separate clusters para critical vs batch workloads

checklist:

Critical path dependencies have dedicated resource pools
Non-critical background work cannot starve critical pedido handling
Resource limits are set per pool (max connections, max threads)
Pool exhaustion gatilhos alerts, não silent queuing

Load Shedding

Intentionally drop low-prioridade work para preserve capacity para high-prioridade traffic.

prioridade tiers:

prioridade	Traffic Type	Shed When
Critical	Health verifica, authentication	Never
High	Core utilizador pedidos	> 95% capacity
Medium	Secondary features, analytics	> 80% capacity
Low	Background jobs, prefetch	> 70% capacity

Implementation:

usar pedido prioridade headers ou path-based classification
Return 503 com Retry-depois header para shed pedidos
monitorizar shed taxa as a métrica (shedding > 0 is an alert)

Graceful Degradation Strategies

estratégia	Description	Example
Feature assinala	Disable non-critical features	transformar off recomendações during high load
Cached fallback	Serve stale dados	Show cached pesquisar results when pesquisar service is down
ler-only mode	Disable writes	Allow browsing but não purchasing during payment outage
Static fallback	Serve pre-generated conteúdo	Show static landing página when CMS is down
fila e retry	Accept but adiar processing	Accept orders, processo when backend recovers

Communication Protocols

Status página Updates

template para status página entry:

[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]

impacto: [brief description de utilizador-visible impacto]
atual status: [What we know e what we're doing]
próximo update: [When para expect o próximo update]

Update cadence:

P0: cada 15 minutes until resolved
P1: cada 30 minutes until resolved
P2: At start e resolução
P3: At resolução only

Stakeholder Notification template

Subject: [P0/P1] [Service] - [brief impacto description]

severidade: P[0-3]
Start time: [ISO 8601 timestamp]
impacto: [Who is affected e how]
atual status: [What we know]
ações taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
próximo update: [When]
incidente commander: [nomear]
War room: [Link/channel]

Internal Communication Rules

One fonte de truth: todos updates go through o incidente channel, não DMs
Facts, não speculation: Share what você know, assinalar what você suspect
Timestamp everything: cada ação e observation gets a timestamp
No blame: Focus on what happened, não who caused it
claro handoffs: When rotating, explicitly entregar off contexto

escalamento Paths

escalamento gatilhos

Condition	ação
P0 não acknowledged in 5 min	página backup on-call
P0/P1 não mitigated in 30 min	Escalate para engenharia manager
P0 não resolved in 1 hour	Escalate para VP/Director
qualquer severidade affecting revenue	Notify finanças e business stakeholders
Security incidente confirmed	Notify security equipa e jurídico
dados breach suspected	Invoke dados breach resposta plano

escalamento checklist

Primary on-call paged e acknowledged
If no acknowledgment in 5 min, secondary on-call paged
incidente commander assigned
Relevant equipa leads notified
Status página updated
cliente suporte briefed com talking points
executivo stakeholders notified (P0/P1 only)

On-Call Responsibilities

During incidente:

Acknowledge página dentro de 5 minutes
Assess severidade e open incidente channel
Begin investigation e document findings in real time
Coordinate com other equipas as needed
Provide status updates at o obrigatório cadence

depois incidente:

Ensure monitoring confirms resolução
Draft incidente timeline
Schedule postmortem if obrigatório
Update runbooks com qualquer novo learnings
entregar off para próximo on-call if shift ends during incidente

Postmortem Patterns

Blameless postmortem structure, root cause analysis techniques, ação item tracking, e chaos engenharia patterns. usar depois incidente resolução ou when designing resilience testing programs.

Blameless Postmortem Structure

Core Principles

No blame: Focus on systems, processos, e conditions -- não individuals
Assume good intent: Everyone involved was doing their best com o information disponível
Learn, don't punish: o goal is prevention, não accountability
Share widely: Postmortems are organizational learning, não equipa shame

Postmortem Document template

# incidente Postmortem: [Title]

**Date:** [incidente date]
**severidade:** P[0-3]
**Duration:** [Start time] para [End time] ([total duration])
**incidente Commander:** [nomear]
**Author:** [nomear]
**Status:** [Draft | rever | Final]

## Summary

[1-2 sentence description de what happened e o utilizador impacto]

## impacto

- **utilizadores affected:** [Number ou percentage]
- **Duration:** [How long utilizadores experienced o issue]
- **Revenue impacto:** [If applicable]
- **dados impacto:** [qualquer dados loss ou corruption]
- **SLA impacto:** [qualquer SLA violations]

## Timeline

todos times in [timezone].

| Time | evento |
|------|-------|
| HH:MM | [primeiro sinal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [severidade classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [mitigação applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |

## Root Cause

[Detailed description do root cause. What condition ou change led para o failure?]

## Contributing Factors

- [Factor 1: e.g., em falta monitoring para this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automatizado rollback configured]

## What Went Well

- [Thing 1: e.g., alert fired dentro de 2 minutes de impacto]
- [Thing 2: e.g., equipa coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth e fast]

## What Went Poorly

- [Thing 1: e.g., took 20 minutes para identify o failing service]
- [Thing 2: e.g., no runbook existed para this failure mode]
- [Thing 3: e.g., status página was não updated para 30 minutes]

## ação Items

| ID | ação | responsável | prioridade | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [ação description] | [nomear] | P1 | [Date] | Open |
| 2 | [ação description] | [nomear] | P2 | [Date] | Open |

## Lessons Learned

[Key takeaways that deve inform future design, processo, ou tooling decisões]

Postmortem Meeting Facilitation

antes o meeting:

Draft o postmortem document e share 24 hours in advance
todos participants rever o timeline para precisão
incidente commander prepara o root cause analysis

During o meeting (60-90 min):

Timeline rever (15 min): Walk through eventos, correct errors, fill lacunas
Root cause discussion (20 min): Apply 5 Whys ou fishbone analysis
Contributing factors (15 min): What made o incidente worse ou harder para resolve?
What went well (10 min): Reinforce effective practices
ação items (20 min): Define concrete, assignable, time-bounded ações

depois o meeting:

Finalize o document dentro de 24 hours
Distribute para o broader organization
Enter ação items no tracking system
Schedule follow-up rever para ação item completion

Root Cause Analysis Techniques

5 Whys

Repeatedly ask "why" para drill past symptoms para o underlying cause.

Example:

Problem: utilizadores received duplicate order confirmation emails.

Why 1: o email service sent o confirmation twice.
Why 2: o order completion evento was published twice.
Why 3: o order service retried depois a timeout.
Why 4: o mensagem broker acknowledged slowly under load.
Why 5: o broker's disk was 95% full, causing write delays.

Root cause: No disk usage monitoring ou alerting on o mensagem broker.
ação: Add disk usage alerting at 80% threshold + auto-scaling.

Guidelines:

Stop when você reach a systemic cause você can fix (processo, tooling, design)
Do não stop at "human error" -- ask why o system allowed o error
Some incidentes have multiple root causes; run 5 Whys para each branch
Answers deve be factual, não speculative

Fishbone Diagram (Ishikawa)

Categorize contributing factors across standard dimensions.

┌─ People: On-call unfamiliar com service
├─ processo: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
├─ Environment: Broker disk at 95%
├─ Monitoring: No disk usage alerts
└─ External: Upstream traffic spike

Standard categories:

People: Knowledge lacunas, staffing, communication
processo: em falta runbooks, pouco claro procedures, aprovação estrangulamentos
Technology: bugs, em falta features, architectural lacunas
Environment: Infrastructure, capacity, configuration
Monitoring: em falta alerts, incorrect thresholds, observability lacunas
External: Third-party outages, traffic spikes, attacks

Fault Tree Analysis

Work backward a partir do failure para identify todos possible causes.

Top evento: Service outage
├── AND: Load balancer failure
│ ├── OR: Config error
│ └── OR: Health verificar misconfigured
└── AND: No failover triggered
├── OR: Failover não configured
└── OR: Failover health verificar also failed

When para usar: Complex incidentes com multiple interacting failures where 5 Whys is insufficient.

ação Item Tracking

ação Item qualidade Criteria

cada ação item devem be:

Specific: claro description de what para do (não "improve monitoring")
Assignable: One responsável, não a equipa
Time-bounded: Due date, não "when we get para it"
Verifiable: claro definition de done
Prioritized: P1 (antes próximo on-call rotation), P2 (this sprint), P3 (this quarter)

ação Item Categories

Category	Description	Examples
Detection	Improve ability para notice o problem	Add alert, improve dashboard
Prevention	Stop o problem a partir de occurring	Fix bug, add validation, improve architecture
mitigação	Reduce impacto when it happens	Add circuit breaker, improve rollback, write runbook
processo	Improve equipa resposta	Update on-call procedures, conduct training

Tracking e Follow-Up

Enter todos ação items no equipa's issue tracker immediately
Tag com ¤KEEP0¤ e incidente ID para traceability
rever open postmortem ação items semanal in equipa standup
Escalate overdue P1 items para engenharia manager
Close ação items only when verified complete (não just "code merged")

ação Item Anti-Patterns

Anti-Pattern	Problem	Better Alternative
"Be more careful"	não acionável	Automate o verificar
"Improve monitoring"	Too vague	"Add alert para X métrica when > Y para Z minutes"
"No responsável assigned"	Will não get done	atribuir a specific person
"Due: TBD"	Will be deprioritized	Set a concrete date
"Add more tests"	Unbounded	"Add regression test para this specific failure mode"

Chaos engenharia Patterns

Fault Injection

Intentionally introduce failures para verify resilience.

Common fault types:

Fault	Tool/Method	valida
Kill service instance	processo kill, pod delete	Auto-restart, health verifica
Network latency	tc netem, Toxiproxy	Timeout handling, circuit breakers
Network partition	iptables, DNS override	Failover, split-brain handling
Disk full	fallocate, dd	Graceful degradation, alerting
CPU exhaustion	stress-ng	Autoscaling, load shedding
Dependency failure	Mock returning 500s	Fallback paths, error handling
Clock skew	chrony offset	Time-dependent logic

Fault Injection checklist

Hypothesis defined: "We believe [X] will happen when [fault]"
Blast radius limited (single instance, canary, staging)
Rollback mechanism pronto (kill switch para o experiment)
Monitoring in place para observe o effect
equipa is aware o experiment is running
Abort criteria defined (stop if real utilizador impacto exceeds N%)

Game Days

Structured exercises where equipas practice incidente resposta against simulated failures.

Game Day Planning template:

## Game Day: [Title]

**Date:** [Date e time]
**Duration:** [Expected duration]
**Facilitator:** [nomear]
**Participants:** [equipa members]

### cenário
[Description do simulated incidente]

### objetivos
- [ ] validar alerting deteta o failure dentro de [N] minutes
- [ ] validar equipa can triagem para correct severidade
- [ ] validar mitigação can be applied dentro de [N] minutes
- [ ] validar communication protocols are followed

### Ground Rules
- This is practice, não evaluation
- Facilitator controls o cenário progression
- Anyone can call "stop" if real production impacto is detected
- Document todos observations in real time

### Debrief Questions
1. Did alerts fire as expected?
2. Was o right equipa engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incidente?

Game day cadence:

trimestral para critical services
depois major architecture changes
When onboarding novo on-call engineers
depois qualquer P0 incidente (test o fixes)

Resilience Testing Methodology

Resilience Maturity Levels

Level	Description	Activities
1 - Reactive	Fix failures depois they happen	Postmortems, basic monitoring
2 - Aware	Know where failures could happen	Failure mode analysis, risco registry
3 - Proactive	Test para failures antes they happen	Chaos experiments in staging
4 - Continuous	Regularly validar resilience in production	automatizado chaos, game days
5 - Anti-fragile	Systems improve through failure	Feedback loops, auto-remediation

Resilience Testing checklist

para each critical service, validar:

Single instance failure: Service recovers when one instance dies
Dependency timeout: Service handles slow dependencies gracefully
Dependency outage: Service degrades (não crashes) when dependency is down
Network partition: Service handles split-brain cenários
Load spike: Service sheds load ou scales under 3x normal traffic
Disk full: Service alerts e degrades antes crashing
Configuration error: Service fails fast com claro error on bad config
Rollback: Previous version can be deployed dentro de 5 minutes
dados corruption: Backup restore has been tested dentro do último quarter

Steady-declarar Hypothesis

antes running qualquer chaos experiment, define what "normal" looks like:

Steady declarar:
- pedido success taxa > 99.9%
- p99 latency < 200ms
- Error taxa < 0.1%
- No alerts firing

Experiment: Kill 1 de 3 service instances

Hypothesis: Steady declarar métricas remain dentro de 10% de baseline
dentro de 60 seconds do fault injection.

Abort if: Error taxa > 5% para more than 30 seconds.

Disponível em: English Français 한국어 Português

Skill de IAManage incidenteOperações

Classify incidentes, coordinate updates, e transformar o resposta em follow-up work. — Claude Skill

Um Skill Claude para Claude Code por NickCrew — executar /incident-response no Claude·Atualizado em 18 de jun. de 2026·vmain@1d565c1

Compatível comChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Ajuda equipas a gerir outages ou degradações de serviço com severidade, responsável, cadência de estado, contenção, atualizações para clientes e postmortem sem culpa.

Classifies severidade so o equipa knows whether this is P0, P1, P2, ou P3.
Defines who owns o incidente, how often para update people, e what channel is o fonte de truth.
transforma noisy Slack, ticket, alert, e cliente notes em cliente-safe status updates.
Separates immediate containment, cliente communication, internal coordination, e post-incidente follow-up.
cria postmortem ações com responsáveis, due dates, e prevention verifica depois o incidente is resolved.

VocêHoje

suporte, engenharia, e leadership discuss an incidente in scattered channels while clientes receive late ou inconsistent updates.

Com /incident-response

Run /incident-response para classify severidade, atribuir an responsável, set update cadence, draft status updates, e capture o postmortem trail.

1 Paste incidente facts e timeline2 Classify severidade e impacto3 Draft updates e ações4 transformar o timeline em postmortem follow-up

Para quem é

Responsável de Suporte

transformar incidente facts em claro severidade, cliente updates, escalamento paths, e follow-up ações.

Ver skills para esta função

Gestor de Projeto

Coordinate responsáveis, timeline, próximo updates, e post-incidente ação items across equipas.

Ver skills para esta função

O que faz

Active incident update

transformar scattered incidente facts na claro atualização de estado para clientes, suporte, e leadership.

Severity e ownership

Decide how serious o incidente is, who owns coordination, e when o próximo update is due.

Postmortem preparation

converter o resposta timeline em root cause, contributing factors, e concrete ação items.

Como funciona

Collect symptoms, utilizador impacto, affected services, timeline, atual responsável, e qualquer cliente relatórios.

Classify severidade e choose o communication cadence.

Separate immediate containment a partir de deeper root-cause work.

Draft cliente, suporte, e leadership updates that say what is known, what is não known, e when o próximo update will arrive.

Record o incidente timeline e ação tracker while facts are changing.

depois resolução, produce a blameless postmortem outline com specific follow-up ações.

Opções de entrada

Symptoms e impact

What utilizadores are experiencing, how many are affected, e whether dados, revenue, ou security is involved.

Exemplo

O que o utilizador cola

09:04 suporte relatórios checkout failures a partir de 18 clientes.
09:07 payments dashboard shows card authorization errors up 32%.
09:10 engenharia suspects o novo fraud rule rollout.
09:14 vendas says two enterprise trials are blocked.
impacto: checkout fails para Safari utilizadores in US/EU. No dados loss. Workaround: retry on Chrome works para some utilizadores.
atual channels: #suporte-urgent, #payments-eng, Zendesk tickets, Jira bug PAY-1842.
precisar de: severidade, cliente update, responsável, internal update, próximo ações, e postmortem notes.

Resultado útil

How it reads the incident

Classify as P1: a revenue-critical path is degraded para a meaningful utilizador segment, but there is no total outage, security issue, ou dados loss. Treat Safari checkout as o affected cliente experience, não “todos payments down.”

Incident command

incidente commander: suporte lead until Payments engenharia nomeia a technical lead. fonte de truth: #inc-checkout-errors. Linked ticket: PAY-1842. Update cadence: cada 30 minutes until mitigação, then cada 60 minutes until resolved. próximo external update due 09:45 UTC.

Cliente-safe update

We are investigating elevated checkout errors affecting some Safari utilizadores in o US e EU. Some clientes may be able para complete checkout in another browser while we isolate o cause. We will share o próximo update by 09:45 UTC.

Internal stakeholder update

impacto: 18 reported clientes plus two blocked enterprise trials. Suspected gatilho: fraud-rule rollout. atual hypothesis is não confirmed. Payments engenharia is checking browser-specific authorization failures e preparing rollback if o rule is implicated.

Action tracker

Now: pause ou roll back o fraud rule rollout. responsável: Payments engenharia.
Now: link todos Zendesk tickets para PAY-1842. responsável: suporte.
próximo: monitorizar checkout success by browser e region. responsável: analytics/Payments.
próximo: preparar a saved suporte reply com workaround e próximo-update time. responsável: suporte lead.

Postmortem seed

Questions: why did o rollout reach Safari utilizadores sem browser-specific checkout monitoring? Why did vendas hear a partir de trials antes o equipa had a atualização de estado? Follow-up ações deve incluir segmented checkout alerts, rollout checklist update, e incidente-linking instructions para suporte.

Revisão humana

Confirm severidade, jurídico/compliance wording, e whether o workaround is safe para publish antes sending qualquer orientado ao cliente update.

Métricas que melhora

Tempo de ciclo de tickets

ajuda suporte e engenharia move urgent incidentes through propriedade e próximo ações faster.

Operações

Higiene de issues

transforma incidente notes em claro bugs, follow-up ações, responsáveis, e due dates.

Operações

Funciona com

Slack

manual

usar incidente channels, updates, e responder notes as o timeline fonte.

Jira

manual

Track incidente follow-up ações, bugs, e postmortem tasks.

Zendesk

manual

usar cliente relatórios e suporte tickets para understand utilizador impacto.

Quer usar Resposta a Incidentes?

Escolha como começar.

Executar no Claude Code

Gratuito. Código aberto.

Instale e execute este skill localmente no seu computador.

Instalar o Claude Code

Abra um terminal no seu computador e cole este comando:

Instalar o skill

Isto descarrega o skill com todos os ficheiros para o seu computador:

Adicione -g no fim para o tornar disponível em todos os seus projetos.

Execute

Inicie o Claude Code, depois escreva o comando:

depois

Ver código no GitHub

Usar no ElasticFlow

Funcionalidades de equipa e colaboração

Execute skills a partir do seu navegador. Partilhe resultados, gira acessos, colabore com a sua equipa. Sem terminal.

Teste grátis de 14 dias. Cancele a qualquer momento.

Ver no GitHub

incidente resposta

Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.

When para usar

Production incidente in progress (outage, degradation, dados loss)
Designing circuit breakers, bulkheads, ou fallback strategies
Conducting ou planning chaos engenharia exercises
Writing ou reviewing postmortem documents
Establishing on-call procedures e escalamento paths

Avoid when:

o issue is a development-time bug com no production impacto
Designing general system architecture (usar system-design instead)

Quick Reference

Topic	Load reference
triagem Framework	¤KEEP0¤
Postmortem Patterns	¤KEEP0¤

incidente resposta workflow

Phase 1: detetar

Alert fires ou utilizador relatório received
Confirm o issue is real (não a false positive)
Identify affected services e utilizador impacto scope

Phase 2: triagem

Classify severidade (P0-P3)
atribuir incidente commander
Open communication channel (war room, Slack channel)
Begin status página updates

Phase 3: Contain

Stop o bleeding: rollback, feature assinalar, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates cada 15 minutes para P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix primeiro)
validar in staging if time permits
Deploy com monitoring e rollback plano pronto
Confirm recovery com métricas returning para baseline

Phase 5: Postmortem

Document timeline dentro de 48 hours
Conduct blameless rever com todos participants
Identify root cause e contributing factors
atribuir ação items com responsáveis e deadlines
Update runbooks e alerting based on lessons learned

severidade Framework

Level	impacto	resposta Time	Examples
P0	Complete outage, dados loss, security breach	Immediate (< 5 min)	Service down, dados corruption, credential leak
P1	Major feature broken, significant utilizador impacto	< 30 min	Payment processing failed, auth broken para region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround disponível	próximo business day	UI glitch, slow relatório generation, cosmetic error

Output

incidente timeline e severidade classification
Containment ações taken
Postmortem document com ação items
Updated runbooks e alerting rules

Common Mistakes

Skipping severidade classification e treating everything as P0
Making changes sem a rollback plano
Forgetting para communicate status para stakeholders
Writing postmortems that atribuir blame instead de identifying systemic issues
não following up on postmortem ação items

Documentos de referência

incidente resposta
- outage
- postmortem
- triage
- incident
- response

incidente resposta

Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.

When para usar

Production incidente in progress (outage, degradation, dados loss)
Designing circuit breakers, bulkheads, ou fallback strategies
Conducting ou planning chaos engenharia exercises
Writing ou reviewing postmortem documents
Establishing on-call procedures e escalamento paths

Avoid when:

o issue is a development-time bug com no production impacto
Designing general system architecture (usar system-design instead)

Quick Reference

Topic	Load reference
triagem Framework	¤KEEP0¤
Postmortem Patterns	¤KEEP0¤

incidente resposta workflow

Phase 1: detetar

Alert fires ou utilizador relatório received
Confirm o issue is real (não a false positive)
Identify affected services e utilizador impacto scope

Phase 2: triagem

Classify severidade (P0-P3)
atribuir incidente commander
Open communication channel (war room, Slack channel)
Begin status página updates

Phase 3: Contain

Stop o bleeding: rollback, feature assinalar, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates cada 15 minutes para P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix primeiro)
validar in staging if time permits
Deploy com monitoring e rollback plano pronto
Confirm recovery com métricas returning para baseline

Phase 5: Postmortem

Document timeline dentro de 48 hours
Conduct blameless rever com todos participants
Identify root cause e contributing factors
atribuir ação items com responsáveis e deadlines
Update runbooks e alerting based on lessons learned

severidade Framework

Level	impacto	resposta Time	Examples
P0	Complete outage, dados loss, security breach	Immediate (< 5 min)	Service down, dados corruption, credential leak
P1	Major feature broken, significant utilizador impacto	< 30 min	Payment processing failed, auth broken para region
P2	Degraded performance, partial feature loss	< 4 hours	Elevated latency, non-critical feature unavailable
P3	Minor issue, workaround disponível	próximo business day	UI glitch, slow relatório generation, cosmetic error

Output

incidente timeline e severidade classification
Containment ações taken
Postmortem document com ação items
Updated runbooks e alerting rules

Common Mistakes

Skipping severidade classification e treating everything as P0
Making changes sem a rollback plano
Forgetting para communicate status para stakeholders
Writing postmortems that atribuir blame instead de identifying systemic issues
não following up on postmortem ação items

triagem Framework

severidade Classification (P0-P3)

P0 -- Critical

Definition: Complete service outage, active dados loss, ou security breach affecting todos utilizadores.

Attribute	Requirement
resposta time	< 5 minutes
incidente commander	obrigatório (senior engineer ou SRE)
Communication cadence	cada 15 minutes para stakeholders
War room	Immediately opened
escalamento	VP/Director notified dentro de 15 minutes
Postmortem	obrigatório dentro de 48 hours

Examples:

Production database unreachable
Authentication service completely down
Active dados corruption ou loss
Security breach com confirmed exfiltration
Payment processing halted

P1 -- High

Definition: Major feature broken ou significant degradation affecting a large subset de utilizadores.

Attribute	Requirement
resposta time	< 30 minutes
incidente commander	obrigatório
Communication cadence	cada 30 minutes para stakeholders
War room	Opened if não resolved in 30 minutes
escalamento	Manager notified dentro de 30 minutes
Postmortem	obrigatório dentro de 1 week

Examples:

Payment processing failing para one region
pesquisar functionality returning errors para 20%+ de queries
API latency 10x above normal
Mobile app crash on lançar para specific OS version

P2 -- Medium

Definition: Degraded performance ou partial feature loss com workarounds disponível.

Attribute	Requirement
resposta time	< 4 hours
incidente commander	opcional (on-call engineer handles)
Communication cadence	atualização de estado at start e resolução
War room	não obrigatório
escalamento	If unresolved depois 8 hours
Postmortem	Recommended

Examples:

Elevated latency (2-3x normal) on non-critical endpoints
Background job processing delayed
Non-critical third-party integration down
relatório generation slow but functional

P3 -- Low

Definition: Minor issue com minimal utilizador impacto. Workaround exists ou issue is cosmetic.

Attribute	Requirement
resposta time	próximo business day
incidente commander	não obrigatório
Communication cadence	ticket update on resolução
War room	não obrigatório
escalamento	não obrigatório
Postmortem	não obrigatório

Examples:

UI rendering glitch in edge case
Non-critical cron job failed (will retry)
Slow dashboard load para internal tool
Minor logging error that does não affect functionality

severidade decisão Tree

Is dados being lost ou corrupted?
├─ Yes → P0
└─ No
Is there a security breach?
├─ Yes → P0
└─ No
Is o primary service completely down?
├─ Yes → P0
└─ No
Is a major feature broken para many utilizadores?
├─ Yes → P1
└─ No
Is performance significantly degraded?
├─ Yes → P2
└─ No → P3

Cascade Prevention

Circuit Breakers

Automatically stop calling a failing dependency para prevent cascading failure.

Implementation checklist:

cada external dependency has a circuit breaker
Failure thresholds are tuned per dependency (não one-size-fits-todos)
OPEN declarar returns a meaningful fallback (cached dados, degraded resposta, error)
HALF-OPEN probes are lightweight (health verificar, não full pedido)
Circuit breaker declarar is observable (métricas, dashboard)
Alerts fire when a circuit breaker opens

Configuration template:

Dependency: [service nomear]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached resposta | error mensagem | degraded mode]

Bulkhead Isolation

Partition resources so failure in one area cannot exhaust resources para another.

Patterns:

Thread pool isolation: Separate thread pools per dependency
Connection pool isolation: Dedicated connection pools per downstream service
processo isolation: Critical e non-critical workloads in separate processos
Infrastructure isolation: Separate clusters para critical vs batch workloads

checklist:

Critical path dependencies have dedicated resource pools
Non-critical background work cannot starve critical pedido handling
Resource limits are set per pool (max connections, max threads)
Pool exhaustion gatilhos alerts, não silent queuing

Load Shedding

Intentionally drop low-prioridade work para preserve capacity para high-prioridade traffic.

prioridade tiers:

prioridade	Traffic Type	Shed When
Critical	Health verifica, authentication	Never
High	Core utilizador pedidos	> 95% capacity
Medium	Secondary features, analytics	> 80% capacity
Low	Background jobs, prefetch	> 70% capacity

Implementation:

usar pedido prioridade headers ou path-based classification
Return 503 com Retry-depois header para shed pedidos
monitorizar shed taxa as a métrica (shedding > 0 is an alert)

Graceful Degradation Strategies

estratégia	Description	Example
Feature assinala	Disable non-critical features	transformar off recomendações during high load
Cached fallback	Serve stale dados	Show cached pesquisar results when pesquisar service is down
ler-only mode	Disable writes	Allow browsing but não purchasing during payment outage
Static fallback	Serve pre-generated conteúdo	Show static landing página when CMS is down
fila e retry	Accept but adiar processing	Accept orders, processo when backend recovers

Communication Protocols

Status página Updates

template para status página entry:

[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]

impacto: [brief description de utilizador-visible impacto]
atual status: [What we know e what we're doing]
próximo update: [When para expect o próximo update]

Update cadence:

P0: cada 15 minutes until resolved
P1: cada 30 minutes until resolved
P2: At start e resolução
P3: At resolução only

Stakeholder Notification template

Subject: [P0/P1] [Service] - [brief impacto description]

severidade: P[0-3]
Start time: [ISO 8601 timestamp]
impacto: [Who is affected e how]
atual status: [What we know]
ações taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
próximo update: [When]
incidente commander: [nomear]
War room: [Link/channel]

Internal Communication Rules

One fonte de truth: todos updates go through o incidente channel, não DMs
Facts, não speculation: Share what você know, assinalar what você suspect
Timestamp everything: cada ação e observation gets a timestamp
No blame: Focus on what happened, não who caused it
claro handoffs: When rotating, explicitly entregar off contexto

escalamento Paths

escalamento gatilhos

Condition	ação
P0 não acknowledged in 5 min	página backup on-call
P0/P1 não mitigated in 30 min	Escalate para engenharia manager
P0 não resolved in 1 hour	Escalate para VP/Director
qualquer severidade affecting revenue	Notify finanças e business stakeholders
Security incidente confirmed	Notify security equipa e jurídico
dados breach suspected	Invoke dados breach resposta plano

escalamento checklist

Primary on-call paged e acknowledged
If no acknowledgment in 5 min, secondary on-call paged
incidente commander assigned
Relevant equipa leads notified
Status página updated
cliente suporte briefed com talking points
executivo stakeholders notified (P0/P1 only)

On-Call Responsibilities

During incidente:

Acknowledge página dentro de 5 minutes
Assess severidade e open incidente channel
Begin investigation e document findings in real time
Coordinate com other equipas as needed
Provide status updates at o obrigatório cadence

depois incidente:

Ensure monitoring confirms resolução
Draft incidente timeline
Schedule postmortem if obrigatório
Update runbooks com qualquer novo learnings
entregar off para próximo on-call if shift ends during incidente

Postmortem Patterns

Blameless postmortem structure, root cause analysis techniques, ação item tracking, e chaos engenharia patterns. usar depois incidente resolução ou when designing resilience testing programs.

Blameless Postmortem Structure

Core Principles

No blame: Focus on systems, processos, e conditions -- não individuals
Assume good intent: Everyone involved was doing their best com o information disponível
Learn, don't punish: o goal is prevention, não accountability
Share widely: Postmortems are organizational learning, não equipa shame

Postmortem Document template

# incidente Postmortem: [Title]

**Date:** [incidente date]
**severidade:** P[0-3]
**Duration:** [Start time] para [End time] ([total duration])
**incidente Commander:** [nomear]
**Author:** [nomear]
**Status:** [Draft | rever | Final]

## Summary

[1-2 sentence description de what happened e o utilizador impacto]

## impacto

- **utilizadores affected:** [Number ou percentage]
- **Duration:** [How long utilizadores experienced o issue]
- **Revenue impacto:** [If applicable]
- **dados impacto:** [qualquer dados loss ou corruption]
- **SLA impacto:** [qualquer SLA violations]

## Timeline

todos times in [timezone].

| Time | evento |
|------|-------|
| HH:MM | [primeiro sinal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [severidade classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [mitigação applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |

## Root Cause

[Detailed description do root cause. What condition ou change led para o failure?]

## Contributing Factors

- [Factor 1: e.g., em falta monitoring para this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automatizado rollback configured]

## What Went Well

- [Thing 1: e.g., alert fired dentro de 2 minutes de impacto]
- [Thing 2: e.g., equipa coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth e fast]

## What Went Poorly

- [Thing 1: e.g., took 20 minutes para identify o failing service]
- [Thing 2: e.g., no runbook existed para this failure mode]
- [Thing 3: e.g., status página was não updated para 30 minutes]

## ação Items

| ID | ação | responsável | prioridade | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [ação description] | [nomear] | P1 | [Date] | Open |
| 2 | [ação description] | [nomear] | P2 | [Date] | Open |

## Lessons Learned

[Key takeaways that deve inform future design, processo, ou tooling decisões]

Postmortem Meeting Facilitation

antes o meeting:

Draft o postmortem document e share 24 hours in advance
todos participants rever o timeline para precisão
incidente commander prepara o root cause analysis

During o meeting (60-90 min):

Timeline rever (15 min): Walk through eventos, correct errors, fill lacunas
Root cause discussion (20 min): Apply 5 Whys ou fishbone analysis
Contributing factors (15 min): What made o incidente worse ou harder para resolve?
What went well (10 min): Reinforce effective practices
ação items (20 min): Define concrete, assignable, time-bounded ações

depois o meeting:

Finalize o document dentro de 24 hours
Distribute para o broader organization
Enter ação items no tracking system
Schedule follow-up rever para ação item completion

Root Cause Analysis Techniques

5 Whys

Repeatedly ask "why" para drill past symptoms para o underlying cause.

Example:

Problem: utilizadores received duplicate order confirmation emails.

Why 1: o email service sent o confirmation twice.
Why 2: o order completion evento was published twice.
Why 3: o order service retried depois a timeout.
Why 4: o mensagem broker acknowledged slowly under load.
Why 5: o broker's disk was 95% full, causing write delays.

Root cause: No disk usage monitoring ou alerting on o mensagem broker.
ação: Add disk usage alerting at 80% threshold + auto-scaling.

Guidelines:

Stop when você reach a systemic cause você can fix (processo, tooling, design)
Do não stop at "human error" -- ask why o system allowed o error
Some incidentes have multiple root causes; run 5 Whys para each branch
Answers deve be factual, não speculative

Fishbone Diagram (Ishikawa)

Categorize contributing factors across standard dimensions.

┌─ People: On-call unfamiliar com service
├─ processo: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
├─ Environment: Broker disk at 95%
├─ Monitoring: No disk usage alerts
└─ External: Upstream traffic spike

Standard categories:

People: Knowledge lacunas, staffing, communication
processo: em falta runbooks, pouco claro procedures, aprovação estrangulamentos
Technology: bugs, em falta features, architectural lacunas
Environment: Infrastructure, capacity, configuration
Monitoring: em falta alerts, incorrect thresholds, observability lacunas
External: Third-party outages, traffic spikes, attacks

Fault Tree Analysis

Work backward a partir do failure para identify todos possible causes.

Top evento: Service outage
├── AND: Load balancer failure
│ ├── OR: Config error
│ └── OR: Health verificar misconfigured
└── AND: No failover triggered
├── OR: Failover não configured
└── OR: Failover health verificar also failed

When para usar: Complex incidentes com multiple interacting failures where 5 Whys is insufficient.

ação Item Tracking

ação Item qualidade Criteria

cada ação item devem be:

Specific: claro description de what para do (não "improve monitoring")
Assignable: One responsável, não a equipa
Time-bounded: Due date, não "when we get para it"
Verifiable: claro definition de done
Prioritized: P1 (antes próximo on-call rotation), P2 (this sprint), P3 (this quarter)

ação Item Categories

Category	Description	Examples
Detection	Improve ability para notice o problem	Add alert, improve dashboard
Prevention	Stop o problem a partir de occurring	Fix bug, add validation, improve architecture
mitigação	Reduce impacto when it happens	Add circuit breaker, improve rollback, write runbook
processo	Improve equipa resposta	Update on-call procedures, conduct training

Tracking e Follow-Up

Enter todos ação items no equipa's issue tracker immediately
Tag com ¤KEEP0¤ e incidente ID para traceability
rever open postmortem ação items semanal in equipa standup
Escalate overdue P1 items para engenharia manager
Close ação items only when verified complete (não just "code merged")

ação Item Anti-Patterns

Anti-Pattern	Problem	Better Alternative
"Be more careful"	não acionável	Automate o verificar
"Improve monitoring"	Too vague	"Add alert para X métrica when > Y para Z minutes"
"No responsável assigned"	Will não get done	atribuir a specific person
"Due: TBD"	Will be deprioritized	Set a concrete date
"Add more tests"	Unbounded	"Add regression test para this specific failure mode"

Chaos engenharia Patterns

Fault Injection

Intentionally introduce failures para verify resilience.

Common fault types:

Fault	Tool/Method	valida
Kill service instance	processo kill, pod delete	Auto-restart, health verifica
Network latency	tc netem, Toxiproxy	Timeout handling, circuit breakers
Network partition	iptables, DNS override	Failover, split-brain handling
Disk full	fallocate, dd	Graceful degradation, alerting
CPU exhaustion	stress-ng	Autoscaling, load shedding
Dependency failure	Mock returning 500s	Fallback paths, error handling
Clock skew	chrony offset	Time-dependent logic

Fault Injection checklist

Hypothesis defined: "We believe [X] will happen when [fault]"
Blast radius limited (single instance, canary, staging)
Rollback mechanism pronto (kill switch para o experiment)
Monitoring in place para observe o effect
equipa is aware o experiment is running
Abort criteria defined (stop if real utilizador impacto exceeds N%)

Game Days

Structured exercises where equipas practice incidente resposta against simulated failures.

Game Day Planning template:

## Game Day: [Title]

**Date:** [Date e time]
**Duration:** [Expected duration]
**Facilitator:** [nomear]
**Participants:** [equipa members]

### cenário
[Description do simulated incidente]

### objetivos
- [ ] validar alerting deteta o failure dentro de [N] minutes
- [ ] validar equipa can triagem para correct severidade
- [ ] validar mitigação can be applied dentro de [N] minutes
- [ ] validar communication protocols are followed

### Ground Rules
- This is practice, não evaluation
- Facilitator controls o cenário progression
- Anyone can call "stop" if real production impacto is detected
- Document todos observations in real time

### Debrief Questions
1. Did alerts fire as expected?
2. Was o right equipa engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incidente?

Game day cadence:

trimestral para critical services
depois major architecture changes
When onboarding novo on-call engineers
depois qualquer P0 incidente (test o fixes)

Resilience Testing Methodology

Resilience Maturity Levels

Level	Description	Activities
1 - Reactive	Fix failures depois they happen	Postmortems, basic monitoring
2 - Aware	Know where failures could happen	Failure mode analysis, risco registry
3 - Proactive	Test para failures antes they happen	Chaos experiments in staging
4 - Continuous	Regularly validar resilience in production	automatizado chaos, game days
5 - Anti-fragile	Systems improve through failure	Feedback loops, auto-remediation

Resilience Testing checklist

para each critical service, validar:

Single instance failure: Service recovers when one instance dies
Dependency timeout: Service handles slow dependencies gracefully
Dependency outage: Service degrades (não crashes) when dependency is down
Network partition: Service handles split-brain cenários
Load spike: Service sheds load ou scales under 3x normal traffic
Disk full: Service alerts e degrades antes crashing
Configuration error: Service fails fast com claro error on bad config
Rollback: Previous version can be deployed dentro de 5 minutes
dados corruption: Backup restore has been tested dentro do último quarter

Steady-declarar Hypothesis

antes running qualquer chaos experiment, define what "normal" looks like:

Steady declarar:
- pedido success taxa > 99.9%
- p99 latency < 200ms
- Error taxa < 0.1%
- No alerts firing

Experiment: Kill 1 de 3 service instances

Hypothesis: Steady declarar métricas remain dentro de 10% de baseline
dentro de 60 seconds do fault injection.

Abort if: Error taxa > 5% para more than 30 seconds.

Classify incidentes, coordinate updates, e transformar o resposta em follow-up work. — Claude Skill

Para quem é

O que faz

Como funciona

Opções de entrada

Exemplo

Métricas que melhora

Funciona com

Quer usar Resposta a Incidentes?

Instruções do skill

incidente resposta

When para usar

Quick Reference

incidente resposta workflow

Phase 1: detetar

Phase 2: triagem

Phase 3: Contain

Phase 4: Resolve

Phase 5: Postmortem

severidade Framework

Output

Common Mistakes

Documentos de referência

incidente resposta

When para usar

Quick Reference

incidente resposta workflow

Phase 1: detetar

Phase 2: triagem

Phase 3: Contain

Phase 4: Resolve

Phase 5: Postmortem

severidade Framework

Output

Common Mistakes

triagem Framework

severidade Classification (P0-P3)

P0 -- Critical

P1 -- High

P2 -- Medium

P3 -- Low

severidade decisão Tree

Cascade Prevention

Circuit Breakers

Bulkhead Isolation

Load Shedding

Graceful Degradation Strategies

Communication Protocols

Status página Updates

Stakeholder Notification template

Internal Communication Rules

escalamento Paths

escalamento gatilhos

escalamento checklist

On-Call Responsibilities

Postmortem Patterns

Blameless Postmortem Structure

Core Principles

Postmortem Document template

Postmortem Meeting Facilitation

Root Cause Analysis Techniques

5 Whys

Fishbone Diagram (Ishikawa)

Fault Tree Analysis

ação Item Tracking

ação Item qualidade Criteria

ação Item Categories

Tracking e Follow-Up

ação Item Anti-Patterns

Chaos engenharia Patterns

Fault Injection

Fault Injection checklist

Game Days

Resilience Testing Methodology

Resilience Maturity Levels

Resilience Testing checklist

Steady-declarar Hypothesis

Classify incidentes, coordinate updates, e transformar o resposta em follow-up work. — Claude Skill

Para quem é

O que faz