Classify incidentes, coordinate updates, e transformar o resposta em follow-up work. — Claude Skill
Um Skill Claude para Claude Code por NickCrew — executar /incident-response no Claude·Atualizado em 18 de jun. de 2026·vmain@1d565c1
Ajuda equipas a gerir outages ou degradações de serviço com severidade, responsável, cadência de estado, contenção, atualizações para clientes e postmortem sem culpa.
- Classifies severidade so o equipa knows whether this is P0, P1, P2, ou P3.
- Defines who owns o incidente, how often para update people, e what channel is o fonte de truth.
- transforma noisy Slack, ticket, alert, e cliente notes em cliente-safe status updates.
- Separates immediate containment, cliente communication, internal coordination, e post-incidente follow-up.
- cria postmortem ações com responsáveis, due dates, e prevention verifica depois o incidente is resolved.
suporte, engenharia, e leadership discuss an incidente in scattered channels while clientes receive late ou inconsistent updates.
Run /incident-response para classify severidade, atribuir an responsável, set update cadence, draft status updates, e capture o postmortem trail.
Para quem é
O que faz
transformar scattered incidente facts na claro atualização de estado para clientes, suporte, e leadership.
Decide how serious o incidente is, who owns coordination, e when o próximo update is due.
converter o resposta timeline em root cause, contributing factors, e concrete ação items.
Como funciona
Collect symptoms, utilizador impacto, affected services, timeline, atual responsável, e qualquer cliente relatórios.
Classify severidade e choose o communication cadence.
Separate immediate containment a partir de deeper root-cause work.
Draft cliente, suporte, e leadership updates that say what is known, what is não known, e when o próximo update will arrive.
Record o incidente timeline e ação tracker while facts are changing.
depois resolução, produce a blameless postmortem outline com specific follow-up ações.
Opções de entrada
What utilizadores are experiencing, how many are affected, e whether dados, revenue, ou security is involved.
Exemplo
09:04 suporte relatórios checkout failures a partir de 18 clientes. 09:07 payments dashboard shows card authorization errors up 32%. 09:10 engenharia suspects o novo fraud rule rollout. 09:14 vendas says two enterprise trials are blocked. impacto: checkout fails para Safari utilizadores in US/EU. No dados loss. Workaround: retry on Chrome works para some utilizadores. atual channels: #suporte-urgent, #payments-eng, Zendesk tickets, Jira bug PAY-1842. precisar de: severidade, cliente update, responsável, internal update, próximo ações, e postmortem notes.
Classify as P1: a revenue-critical path is degraded para a meaningful utilizador segment, but there is no total outage, security issue, ou dados loss. Treat Safari checkout as o affected cliente experience, não “todos payments down.”
incidente commander: suporte lead until Payments engenharia nomeia a technical lead. fonte de truth: #inc-checkout-errors. Linked ticket: PAY-1842. Update cadence: cada 30 minutes until mitigação, then cada 60 minutes until resolved. próximo external update due 09:45 UTC.
We are investigating elevated checkout errors affecting some Safari utilizadores in o US e EU. Some clientes may be able para complete checkout in another browser while we isolate o cause. We will share o próximo update by 09:45 UTC.
impacto: 18 reported clientes plus two blocked enterprise trials. Suspected gatilho: fraud-rule rollout. atual hypothesis is não confirmed. Payments engenharia is checking browser-specific authorization failures e preparing rollback if o rule is implicated.
Now: pause ou roll back o fraud rule rollout. responsável: Payments engenharia. Now: link todos Zendesk tickets para PAY-1842. responsável: suporte. próximo: monitorizar checkout success by browser e region. responsável: analytics/Payments. próximo: preparar a saved suporte reply com workaround e próximo-update time. responsável: suporte lead.
Questions: why did o rollout reach Safari utilizadores sem browser-specific checkout monitoring? Why did vendas hear a partir de trials antes o equipa had a atualização de estado? Follow-up ações deve incluir segmented checkout alerts, rollout checklist update, e incidente-linking instructions para suporte.
Confirm severidade, jurídico/compliance wording, e whether o workaround is safe para publish antes sending qualquer orientado ao cliente update.
Métricas que melhora
Funciona com
Quer usar Resposta a Incidentes?
Escolha como começar.
Instale e execute este skill localmente no seu computador.
Abra um terminal no seu computador e cole este comando:
Isto descarrega o skill com todos os ficheiros para o seu computador:
Adicione -g no fim para o tornar disponível em todos os seus projetos.
Inicie o Claude Code, depois escreva o comando:
incidente resposta
Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.
When para usar
- Production incidente in progress (outage, degradation, dados loss)
- Designing circuit breakers, bulkheads, ou fallback strategies
- Conducting ou planning chaos engenharia exercises
- Writing ou reviewing postmortem documents
- Establishing on-call procedures e escalamento paths
Avoid when:
- o issue is a development-time bug com no production impacto
- Designing general system architecture (usar system-design instead)
Quick Reference
| Topic | Load reference |
|---|---|
| triagem Framework | ¤KEEP0¤ |
| Postmortem Patterns | ¤KEEP0¤ |
incidente resposta workflow
Phase 1: detetar
- Alert fires ou utilizador relatório received
- Confirm o issue is real (não a false positive)
- Identify affected services e utilizador impacto scope
Phase 2: triagem
- Classify severidade (P0-P3)
- atribuir incidente commander
- Open communication channel (war room, Slack channel)
- Begin status página updates
Phase 3: Contain
- Stop o bleeding: rollback, feature assinalar, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates cada 15 minutes para P0/P1
Phase 4: Resolve
- Implement fix (minimal viable fix primeiro)
- validar in staging if time permits
- Deploy com monitoring e rollback plano pronto
- Confirm recovery com métricas returning para baseline
Phase 5: Postmortem
- Document timeline dentro de 48 hours
- Conduct blameless rever com todos participants
- Identify root cause e contributing factors
- atribuir ação items com responsáveis e deadlines
- Update runbooks e alerting based on lessons learned
severidade Framework
| Level | impacto | resposta Time | Examples |
|---|---|---|---|
| P0 | Complete outage, dados loss, security breach | Immediate (< 5 min) | Service down, dados corruption, credential leak |
| P1 | Major feature broken, significant utilizador impacto | < 30 min | Payment processing failed, auth broken para region |
| P2 | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| P3 | Minor issue, workaround disponível | próximo business day | UI glitch, slow relatório generation, cosmetic error |
Output
- incidente timeline e severidade classification
- Containment ações taken
- Postmortem document com ação items
- Updated runbooks e alerting rules
Common Mistakes
- Skipping severidade classification e treating everything as P0
- Making changes sem a rollback plano
- Forgetting para communicate status para stakeholders
- Writing postmortems that atribuir blame instead de identifying systemic issues
- não following up on postmortem ação items
Documentos de referência
name: incident-response description: incidente triagem, cascade prevention, e postmortem methodology. usar when handling production incidentes, designing resilience patterns, ou conducting chaos engenharia exercises. keywords:
- incidente resposta
- outage
- postmortem
- triage
- incident
- response
incidente resposta
Structured incidente management a partir de detection through postmortem, com resilience patterns para preventing e containing cascading failures.
When para usar
- Production incidente in progress (outage, degradation, dados loss)
- Designing circuit breakers, bulkheads, ou fallback strategies
- Conducting ou planning chaos engenharia exercises
- Writing ou reviewing postmortem documents
- Establishing on-call procedures e escalamento paths
Avoid when:
- o issue is a development-time bug com no production impacto
- Designing general system architecture (usar system-design instead)
Quick Reference
| Topic | Load reference |
|---|---|
| triagem Framework | ¤KEEP0¤ |
| Postmortem Patterns | ¤KEEP0¤ |
incidente resposta workflow
Phase 1: detetar
- Alert fires ou utilizador relatório received
- Confirm o issue is real (não a false positive)
- Identify affected services e utilizador impacto scope
Phase 2: triagem
- Classify severidade (P0-P3)
- atribuir incidente commander
- Open communication channel (war room, Slack channel)
- Begin status página updates
Phase 3: Contain
- Stop o bleeding: rollback, feature assinalar, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates cada 15 minutes para P0/P1
Phase 4: Resolve
- Implement fix (minimal viable fix primeiro)
- validar in staging if time permits
- Deploy com monitoring e rollback plano pronto
- Confirm recovery com métricas returning para baseline
Phase 5: Postmortem
- Document timeline dentro de 48 hours
- Conduct blameless rever com todos participants
- Identify root cause e contributing factors
- atribuir ação items com responsáveis e deadlines
- Update runbooks e alerting based on lessons learned
severidade Framework
| Level | impacto | resposta Time | Examples |
|---|---|---|---|
| P0 | Complete outage, dados loss, security breach | Immediate (< 5 min) | Service down, dados corruption, credential leak |
| P1 | Major feature broken, significant utilizador impacto | < 30 min | Payment processing failed, auth broken para region |
| P2 | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| P3 | Minor issue, workaround disponível | próximo business day | UI glitch, slow relatório generation, cosmetic error |
Output
- incidente timeline e severidade classification
- Containment ações taken
- Postmortem document com ação items
- Updated runbooks e alerting rules
Common Mistakes
- Skipping severidade classification e treating everything as P0
- Making changes sem a rollback plano
- Forgetting para communicate status para stakeholders
- Writing postmortems that atribuir blame instead de identifying systemic issues
- não following up on postmortem ação items
triagem Framework
severidade classification, cascade prevention, communication protocols, e escalamento paths para production incidentes. usar during active incidentes ou when establishing incidente resposta procedures.
severidade Classification (P0-P3)
P0 -- Critical
Definition: Complete service outage, active dados loss, ou security breach affecting todos utilizadores.
| Attribute | Requirement |
|---|---|
| resposta time | < 5 minutes |
| incidente commander | obrigatório (senior engineer ou SRE) |
| Communication cadence | cada 15 minutes para stakeholders |
| War room | Immediately opened |
| escalamento | VP/Director notified dentro de 15 minutes |
| Postmortem | obrigatório dentro de 48 hours |
Examples:
- Production database unreachable
- Authentication service completely down
- Active dados corruption ou loss
- Security breach com confirmed exfiltration
- Payment processing halted
P1 -- High
Definition: Major feature broken ou significant degradation affecting a large subset de utilizadores.
| Attribute | Requirement |
|---|---|
| resposta time | < 30 minutes |
| incidente commander | obrigatório |
| Communication cadence | cada 30 minutes para stakeholders |
| War room | Opened if não resolved in 30 minutes |
| escalamento | Manager notified dentro de 30 minutes |
| Postmortem | obrigatório dentro de 1 week |
Examples:
- Payment processing failing para one region
- pesquisar functionality returning errors para 20%+ de queries
- API latency 10x above normal
- Mobile app crash on lançar para specific OS version
P2 -- Medium
Definition: Degraded performance ou partial feature loss com workarounds disponível.
| Attribute | Requirement |
|---|---|
| resposta time | < 4 hours |
| incidente commander | opcional (on-call engineer handles) |
| Communication cadence | atualização de estado at start e resolução |
| War room | não obrigatório |
| escalamento | If unresolved depois 8 hours |
| Postmortem | Recommended |
Examples:
- Elevated latency (2-3x normal) on non-critical endpoints
- Background job processing delayed
- Non-critical third-party integration down
- relatório generation slow but functional
P3 -- Low
Definition: Minor issue com minimal utilizador impacto. Workaround exists ou issue is cosmetic.
| Attribute | Requirement |
|---|---|
| resposta time | próximo business day |
| incidente commander | não obrigatório |
| Communication cadence | ticket update on resolução |
| War room | não obrigatório |
| escalamento | não obrigatório |
| Postmortem | não obrigatório |
Examples:
- UI rendering glitch in edge case
- Non-critical cron job failed (will retry)
- Slow dashboard load para internal tool
- Minor logging error that does não affect functionality
severidade decisão Tree
Is dados being lost ou corrupted?
├─ Yes → P0
└─ No
Is there a security breach?
├─ Yes → P0
└─ No
Is o primary service completely down?
├─ Yes → P0
└─ No
Is a major feature broken para many utilizadores?
├─ Yes → P1
└─ No
Is performance significantly degraded?
├─ Yes → P2
└─ No → P3
Cascade Prevention
Circuit Breakers
Automatically stop calling a failing dependency para prevent cascading failure.
Implementation checklist:
- cada external dependency has a circuit breaker
- Failure thresholds are tuned per dependency (não one-size-fits-todos)
- OPEN declarar returns a meaningful fallback (cached dados, degraded resposta, error)
- HALF-OPEN probes are lightweight (health verificar, não full pedido)
- Circuit breaker declarar is observable (métricas, dashboard)
- Alerts fire when a circuit breaker opens
Configuration template:
Dependency: [service nomear]
Failure threshold: [N] failures in [T] seconds
Reset timeout: [T] seconds
Fallback: [cached resposta | error mensagem | degraded mode]
Bulkhead Isolation
Partition resources so failure in one area cannot exhaust resources para another.
Patterns:
- Thread pool isolation: Separate thread pools per dependency
- Connection pool isolation: Dedicated connection pools per downstream service
- processo isolation: Critical e non-critical workloads in separate processos
- Infrastructure isolation: Separate clusters para critical vs batch workloads
checklist:
- Critical path dependencies have dedicated resource pools
- Non-critical background work cannot starve critical pedido handling
- Resource limits are set per pool (max connections, max threads)
- Pool exhaustion gatilhos alerts, não silent queuing
Load Shedding
Intentionally drop low-prioridade work para preserve capacity para high-prioridade traffic.
prioridade tiers:
| prioridade | Traffic Type | Shed When |
|---|---|---|
| Critical | Health verifica, authentication | Never |
| High | Core utilizador pedidos | > 95% capacity |
| Medium | Secondary features, analytics | > 80% capacity |
| Low | Background jobs, prefetch | > 70% capacity |
Implementation:
- usar pedido prioridade headers ou path-based classification
- Return 503 com Retry-depois header para shed pedidos
- monitorizar shed taxa as a métrica (shedding > 0 is an alert)
Graceful Degradation Strategies
| estratégia | Description | Example |
|---|---|---|
| Feature assinala | Disable non-critical features | transformar off recomendações during high load |
| Cached fallback | Serve stale dados | Show cached pesquisar results when pesquisar service is down |
| ler-only mode | Disable writes | Allow browsing but não purchasing during payment outage |
| Static fallback | Serve pre-generated conteúdo | Show static landing página when CMS is down |
| fila e retry | Accept but adiar processing | Accept orders, processo when backend recovers |
Communication Protocols
Status página Updates
template para status página entry:
[TIMESTAMP] - [STATUS: Investigating | Identified | Monitoring | Resolved]
impacto: [brief description de utilizador-visible impacto]
atual status: [What we know e what we're doing]
próximo update: [When para expect o próximo update]
Update cadence:
- P0: cada 15 minutes until resolved
- P1: cada 30 minutes until resolved
- P2: At start e resolução
- P3: At resolução only
Stakeholder Notification template
Subject: [P0/P1] [Service] - [brief impacto description]
severidade: P[0-3]
Start time: [ISO 8601 timestamp]
impacto: [Who is affected e how]
atual status: [What we know]
ações taken: [What we've done so far]
ETA: [If known, otherwise "investigating"]
próximo update: [When]
incidente commander: [nomear]
War room: [Link/channel]
Internal Communication Rules
- One fonte de truth: todos updates go through o incidente channel, não DMs
- Facts, não speculation: Share what você know, assinalar what você suspect
- Timestamp everything: cada ação e observation gets a timestamp
- No blame: Focus on what happened, não who caused it
- claro handoffs: When rotating, explicitly entregar off contexto
escalamento Paths
escalamento gatilhos
| Condition | ação |
|---|---|
| P0 não acknowledged in 5 min | página backup on-call |
| P0/P1 não mitigated in 30 min | Escalate para engenharia manager |
| P0 não resolved in 1 hour | Escalate para VP/Director |
| qualquer severidade affecting revenue | Notify finanças e business stakeholders |
| Security incidente confirmed | Notify security equipa e jurídico |
| dados breach suspected | Invoke dados breach resposta plano |
escalamento checklist
- Primary on-call paged e acknowledged
- If no acknowledgment in 5 min, secondary on-call paged
- incidente commander assigned
- Relevant equipa leads notified
- Status página updated
- cliente suporte briefed com talking points
- executivo stakeholders notified (P0/P1 only)
On-Call Responsibilities
During incidente:
- Acknowledge página dentro de 5 minutes
- Assess severidade e open incidente channel
- Begin investigation e document findings in real time
- Coordinate com other equipas as needed
- Provide status updates at o obrigatório cadence
depois incidente:
- Ensure monitoring confirms resolução
- Draft incidente timeline
- Schedule postmortem if obrigatório
- Update runbooks com qualquer novo learnings
- entregar off para próximo on-call if shift ends during incidente
Postmortem Patterns
Blameless postmortem structure, root cause analysis techniques, ação item tracking, e chaos engenharia patterns. usar depois incidente resolução ou when designing resilience testing programs.
Blameless Postmortem Structure
Core Principles
- No blame: Focus on systems, processos, e conditions -- não individuals
- Assume good intent: Everyone involved was doing their best com o information disponível
- Learn, don't punish: o goal is prevention, não accountability
- Share widely: Postmortems are organizational learning, não equipa shame
Postmortem Document template
# incidente Postmortem: [Title]
**Date:** [incidente date]
**severidade:** P[0-3]
**Duration:** [Start time] para [End time] ([total duration])
**incidente Commander:** [nomear]
**Author:** [nomear]
**Status:** [Draft | rever | Final]
## Summary
[1-2 sentence description de what happened e o utilizador impacto]
## impacto
- **utilizadores affected:** [Number ou percentage]
- **Duration:** [How long utilizadores experienced o issue]
- **Revenue impacto:** [If applicable]
- **dados impacto:** [qualquer dados loss ou corruption]
- **SLA impacto:** [qualquer SLA violations]
## Timeline
todos times in [timezone].
| Time | evento |
|------|-------|
| HH:MM | [primeiro sinal / alert fired] |
| HH:MM | [On-call acknowledged] |
| HH:MM | [severidade classified as P_] |
| HH:MM | [Key investigation finding] |
| HH:MM | [mitigação applied] |
| HH:MM | [Issue confirmed resolved] |
| HH:MM | [Monitoring confirmed stable] |
## Root Cause
[Detailed description do root cause. What condition ou change led para o failure?]
## Contributing Factors
- [Factor 1: e.g., em falta monitoring para this failure mode]
- [Factor 2: e.g., deployment during high-traffic period]
- [Factor 3: e.g., no automatizado rollback configured]
## What Went Well
- [Thing 1: e.g., alert fired dentro de 2 minutes de impacto]
- [Thing 2: e.g., equipa coordinated effectively in war room]
- [Thing 3: e.g., rollback was smooth e fast]
## What Went Poorly
- [Thing 1: e.g., took 20 minutes para identify o failing service]
- [Thing 2: e.g., no runbook existed para this failure mode]
- [Thing 3: e.g., status página was não updated para 30 minutes]
## ação Items
| ID | ação | responsável | prioridade | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [ação description] | [nomear] | P1 | [Date] | Open |
| 2 | [ação description] | [nomear] | P2 | [Date] | Open |
## Lessons Learned
[Key takeaways that deve inform future design, processo, ou tooling decisões]
Postmortem Meeting Facilitation
antes o meeting:
- Draft o postmortem document e share 24 hours in advance
- todos participants rever o timeline para precisão
- incidente commander prepara o root cause analysis
During o meeting (60-90 min):
- Timeline rever (15 min): Walk through eventos, correct errors, fill lacunas
- Root cause discussion (20 min): Apply 5 Whys ou fishbone analysis
- Contributing factors (15 min): What made o incidente worse ou harder para resolve?
- What went well (10 min): Reinforce effective practices
- ação items (20 min): Define concrete, assignable, time-bounded ações
depois o meeting:
- Finalize o document dentro de 24 hours
- Distribute para o broader organization
- Enter ação items no tracking system
- Schedule follow-up rever para ação item completion
Root Cause Analysis Techniques
5 Whys
Repeatedly ask "why" para drill past symptoms para o underlying cause.
Example:
Problem: utilizadores received duplicate order confirmation emails.
Why 1: o email service sent o confirmation twice.
Why 2: o order completion evento was published twice.
Why 3: o order service retried depois a timeout.
Why 4: o mensagem broker acknowledged slowly under load.
Why 5: o broker's disk was 95% full, causing write delays.
Root cause: No disk usage monitoring ou alerting on o mensagem broker.
ação: Add disk usage alerting at 80% threshold + auto-scaling.
Guidelines:
- Stop when você reach a systemic cause você can fix (processo, tooling, design)
- Do não stop at "human error" -- ask why o system allowed o error
- Some incidentes have multiple root causes; run 5 Whys para each branch
- Answers deve be factual, não speculative
Fishbone Diagram (Ishikawa)
Categorize contributing factors across standard dimensions.
┌─ People: On-call unfamiliar com service
├─ processo: No rollback runbook existed
Duplicate emails ───├─ Technology: No idempotency on email sends
├─ Environment: Broker disk at 95%
├─ Monitoring: No disk usage alerts
└─ External: Upstream traffic spike
Standard categories:
- People: Knowledge lacunas, staffing, communication
- processo: em falta runbooks, pouco claro procedures, aprovação estrangulamentos
- Technology: bugs, em falta features, architectural lacunas
- Environment: Infrastructure, capacity, configuration
- Monitoring: em falta alerts, incorrect thresholds, observability lacunas
- External: Third-party outages, traffic spikes, attacks
Fault Tree Analysis
Work backward a partir do failure para identify todos possible causes.
Top evento: Service outage
├── AND: Load balancer failure
│ ├── OR: Config error
│ └── OR: Health verificar misconfigured
└── AND: No failover triggered
├── OR: Failover não configured
└── OR: Failover health verificar also failed
When para usar: Complex incidentes com multiple interacting failures where 5 Whys is insufficient.
ação Item Tracking
ação Item qualidade Criteria
cada ação item devem be:
- Specific: claro description de what para do (não "improve monitoring")
- Assignable: One responsável, não a equipa
- Time-bounded: Due date, não "when we get para it"
- Verifiable: claro definition de done
- Prioritized: P1 (antes próximo on-call rotation), P2 (this sprint), P3 (this quarter)
ação Item Categories
| Category | Description | Examples |
|---|---|---|
| Detection | Improve ability para notice o problem | Add alert, improve dashboard |
| Prevention | Stop o problem a partir de occurring | Fix bug, add validation, improve architecture |
| mitigação | Reduce impacto when it happens | Add circuit breaker, improve rollback, write runbook |
| processo | Improve equipa resposta | Update on-call procedures, conduct training |
Tracking e Follow-Up
- Enter todos ação items no equipa's issue tracker immediately
- Tag com ¤KEEP0¤ e incidente ID para traceability
- rever open postmortem ação items semanal in equipa standup
- Escalate overdue P1 items para engenharia manager
- Close ação items only when verified complete (não just "code merged")
ação Item Anti-Patterns
| Anti-Pattern | Problem | Better Alternative |
|---|---|---|
| "Be more careful" | não acionável | Automate o verificar |
| "Improve monitoring" | Too vague | "Add alert para X métrica when > Y para Z minutes" |
| "No responsável assigned" | Will não get done | atribuir a specific person |
| "Due: TBD" | Will be deprioritized | Set a concrete date |
| "Add more tests" | Unbounded | "Add regression test para this specific failure mode" |
Chaos engenharia Patterns
Fault Injection
Intentionally introduce failures para verify resilience.
Common fault types:
| Fault | Tool/Method | valida |
|---|---|---|
| Kill service instance | processo kill, pod delete | Auto-restart, health verifica |
| Network latency | tc netem, Toxiproxy | Timeout handling, circuit breakers |
| Network partition | iptables, DNS override | Failover, split-brain handling |
| Disk full | fallocate, dd | Graceful degradation, alerting |
| CPU exhaustion | stress-ng | Autoscaling, load shedding |
| Dependency failure | Mock returning 500s | Fallback paths, error handling |
| Clock skew | chrony offset | Time-dependent logic |
Fault Injection checklist
- Hypothesis defined: "We believe [X] will happen when [fault]"
- Blast radius limited (single instance, canary, staging)
- Rollback mechanism pronto (kill switch para o experiment)
- Monitoring in place para observe o effect
- equipa is aware o experiment is running
- Abort criteria defined (stop if real utilizador impacto exceeds N%)
Game Days
Structured exercises where equipas practice incidente resposta against simulated failures.
Game Day Planning template:
## Game Day: [Title]
**Date:** [Date e time]
**Duration:** [Expected duration]
**Facilitator:** [nomear]
**Participants:** [equipa members]
### cenário
[Description do simulated incidente]
### objetivos
- [ ] validar alerting deteta o failure dentro de [N] minutes
- [ ] validar equipa can triagem para correct severidade
- [ ] validar mitigação can be applied dentro de [N] minutes
- [ ] validar communication protocols are followed
### Ground Rules
- This is practice, não evaluation
- Facilitator controls o cenário progression
- Anyone can call "stop" if real production impacto is detected
- Document todos observations in real time
### Debrief Questions
1. Did alerts fire as expected?
2. Was o right equipa engaged quickly enough?
3. Were runbooks adequate?
4. What would we do differently in a real incidente?
Game day cadence:
- trimestral para critical services
- depois major architecture changes
- When onboarding novo on-call engineers
- depois qualquer P0 incidente (test o fixes)
Resilience Testing Methodology
Resilience Maturity Levels
| Level | Description | Activities |
|---|---|---|
| 1 - Reactive | Fix failures depois they happen | Postmortems, basic monitoring |
| 2 - Aware | Know where failures could happen | Failure mode analysis, risco registry |
| 3 - Proactive | Test para failures antes they happen | Chaos experiments in staging |
| 4 - Continuous | Regularly validar resilience in production | automatizado chaos, game days |
| 5 - Anti-fragile | Systems improve through failure | Feedback loops, auto-remediation |
Resilience Testing checklist
para each critical service, validar:
- Single instance failure: Service recovers when one instance dies
- Dependency timeout: Service handles slow dependencies gracefully
- Dependency outage: Service degrades (não crashes) when dependency is down
- Network partition: Service handles split-brain cenários
- Load spike: Service sheds load ou scales under 3x normal traffic
- Disk full: Service alerts e degrades antes crashing
- Configuration error: Service fails fast com claro error on bad config
- Rollback: Previous version can be deployed dentro de 5 minutes
- dados corruption: Backup restore has been tested dentro do último quarter
Steady-declarar Hypothesis
antes running qualquer chaos experiment, define what "normal" looks like:
Steady declarar:
- pedido success taxa > 99.9%
- p99 latency < 200ms
- Error taxa < 0.1%
- No alerts firing
Experiment: Kill 1 de 3 service instances
Hypothesis: Steady declarar métricas remain dentro de 10% de baseline
dentro de 60 seconds do fault injection.
Abort if: Error taxa > 5% para more than 30 seconds.