Skip to main content

Runbook: Incident Response

Mode: NEXUS-Micro | Duration: Minutes to hours | Agents: 3-8

Scenario

Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem.

Severity Classification

LevelDefinitionExamplesResponse Time
P0 — CriticalService completely down, data loss, security breachDatabase corruption, DDoS attack, auth system failureImmediate (all hands)
P1 — HighMajor feature broken, significant performance degradationPayment processing down, 50%+ error rate, 10x latency< 1 hour
P2 — MediumMinor feature broken, workaround availableSearch not working, non-critical API errors< 4 hours
P3 — LowCosmetic issue, minor inconvenienceStyling bug, typo, minor UI glitchNext sprint

Response Teams by Severity

P0 — Critical Response Team

Incident Commander

Infrastructure MaintainerAssess scope, coordinate response

Deployment/Rollback

DevOps AutomatorExecute rollback if needed

Root Cause Investigation

Backend ArchitectDiagnose system issues

UI Investigation

Frontend DeveloperDiagnose client-side issues

User Communication

Support ResponderStatus page updates, user notifications

Stakeholder Comms

Executive Summary GeneratorReal-time executive updates

P1 — High Response Team

  • Infrastructure Maintainer (Incident commander)
  • DevOps Automator (Deployment support)
  • Relevant Developer Agent (Fix implementation)
  • Support Responder (User communication)

P2 — Medium Response

  • Relevant Developer Agent (Fix implementation)
  • Evidence Collector (Verify fix)

P3 — Low Response

  • Sprint Prioritizer (Add to backlog)

Incident Response Sequence

Step 1: Detection & Triage (0-5 minutes)

1

Trigger Received

Alert from monitoring / User report / Agent detection
2

Infrastructure Maintainer Actions

  1. Acknowledge alert
  2. Assess scope and impact
    • How many users affected?
    • Which services are impacted?
    • Is data at risk?
  3. Classify severity (P0/P1/P2/P3)
  4. Activate appropriate response team
  5. Create incident channel/thread
3

Output

Incident classification + response team activated

Step 2: Investigation (5-30 minutes)

Parallel Investigation:
  • Check system metrics (CPU, memory, network, disk)
  • Review error logs
  • Check recent deployments
  • Verify external dependencies
Output: Root cause identified (or narrowed to component)

Step 3: Mitigation (15-60 minutes)

Throughout Mitigation:
  • Support Responder: Update status page every 15 minutes
  • Executive Summary Generator: Brief stakeholders (P0 only)

Step 4: Resolution Verification (Post-fix)

Evidence Collector

  1. Verify fix resolves the issue
  2. Screenshot evidence of working state
  3. Confirm no new issues introduced

Infrastructure Maintainer

  1. Verify all metrics returning to normal
  2. Confirm no cascading failures
  3. Monitor for 30 minutes post-fix

API Tester

If API-related:
  1. Run regression on affected endpoints
  2. Verify response times normalized
  3. Confirm error rates at baseline
Output: Incident resolved confirmation

Step 5: Post-Mortem (Within 48 hours)

1. Timeline Reconstruction
  • When was the issue introduced?
  • When was it detected?
  • When was it resolved?
  • Total user impact duration
2. Root Cause Analysis
  • What failed?
  • Why did it fail?
  • Why wasn’t it caught earlier?
  • 5 Whys analysis
3. Impact Assessment
  • Users affected
  • Revenue impact
  • Reputation impact
  • Data impact
4. Prevention Measures
  • What monitoring would have caught this sooner?
  • What testing would have prevented this?
  • What process changes are needed?
  • What infrastructure changes are needed?
5. Action Items
  • [Action] → [Owner] → [Deadline]
  • [Action] → [Owner] → [Deadline]
  • [Action] → [Owner] → [Deadline]
Output: Post-Mortem Report → Sprint Prioritizer adds prevention tasks to backlog

Communication Templates

Status Page Update (Support Responder)

[TIMESTAMP] — [SERVICE NAME] Incident

Status: [Investigating / Identified / Monitoring / Resolved]
Impact: [Description of user impact]
Current action: [What we're doing about it]
Next update: [When to expect the next update]

Executive Update (Executive Summary Generator — P0 only)

INCIDENT BRIEF — [TIMESTAMP]

SITUATION: [Service] is [down/degraded] affecting [N users/% of traffic]
CAUSE: [Known/Under investigation] — [Brief description if known]
ACTION: [What's being done] — ETA [time estimate]
IMPACT: [Business impact — revenue, users, reputation]
NEXT UPDATE: [Timestamp]

Escalation Matrix

Critical Escalation Points
ConditionEscalate ToAction
P0 not resolved in 30 minStudio ProducerAdditional resources, vendor escalation
P1 not resolved in 2 hoursProject ShepherdResource reallocation
Data breach suspectedLegal Compliance CheckerRegulatory notification assessment
User data affectedLegal Compliance + Executive SummaryGDPR/CCPA notification
Revenue impact > $XFinance Tracker + Studio ProducerBusiness impact assessment

Speed matters, but so does documentation. Every incident is a learning opportunity to improve the system.

Build docs developers (and LLMs) love