Runbook: Incident Response

Mode: NEXUS-Micro | Duration: Minutes to hours | Agents: 3-8

Scenario

Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem.

Severity Classification

Level	Definition	Examples	Response Time
P0 — Critical	Service completely down, data loss, security breach	Database corruption, DDoS attack, auth system failure	Immediate (all hands)
P1 — High	Major feature broken, significant performance degradation	Payment processing down, 50%+ error rate, 10x latency	< 1 hour
P2 — Medium	Minor feature broken, workaround available	Search not working, non-critical API errors	< 4 hours
P3 — Low	Cosmetic issue, minor inconvenience	Styling bug, typo, minor UI glitch	Next sprint

Response Teams by Severity

P0 — Critical Response Team

Incident Commander

Infrastructure MaintainerAssess scope, coordinate response

Deployment/Rollback

DevOps AutomatorExecute rollback if needed

Root Cause Investigation

Backend ArchitectDiagnose system issues

UI Investigation

Frontend DeveloperDiagnose client-side issues

User Communication

Support ResponderStatus page updates, user notifications

Stakeholder Comms

Executive Summary GeneratorReal-time executive updates

P1 — High Response Team

Infrastructure Maintainer (Incident commander)
DevOps Automator (Deployment support)
Relevant Developer Agent (Fix implementation)
Support Responder (User communication)

P2 — Medium Response

Relevant Developer Agent (Fix implementation)
Evidence Collector (Verify fix)

P3 — Low Response

Sprint Prioritizer (Add to backlog)

Incident Response Sequence

Step 1: Detection & Triage (0-5 minutes)

Trigger Received

Alert from monitoring / User report / Agent detection

Infrastructure Maintainer Actions

Acknowledge alert
Assess scope and impact
- How many users affected?
- Which services are impacted?
- Is data at risk?
Classify severity (P0/P1/P2/P3)
Activate appropriate response team
Create incident channel/thread

Output

Incident classification + response team activated

Step 2: Investigation (5-30 minutes)

Infrastructure Maintainer
Backend Architect (P0/P1)
DevOps Automator

Parallel Investigation:

Check system metrics (CPU, memory, network, disk)
Review error logs
Check recent deployments
Verify external dependencies

Output: Root cause identified (or narrowed to component)

Step 3: Mitigation (15-60 minutes)

Decision Tree

Throughout Mitigation:

Support Responder: Update status page every 15 minutes
Executive Summary Generator: Brief stakeholders (P0 only)

Step 4: Resolution Verification (Post-fix)

Evidence Collector

Verify fix resolves the issue
Screenshot evidence of working state
Confirm no new issues introduced

Infrastructure Maintainer

Verify all metrics returning to normal
Confirm no cascading failures
Monitor for 30 minutes post-fix

API Tester

If API-related:

Run regression on affected endpoints
Verify response times normalized
Confirm error rates at baseline

Output: Incident resolved confirmation

Step 5: Post-Mortem (Within 48 hours)

Workflow Optimizer leads post-mortem

1. Timeline Reconstruction

When was the issue introduced?
When was it detected?
When was it resolved?
Total user impact duration

2. Root Cause Analysis

What failed?
Why did it fail?
Why wasn’t it caught earlier?
5 Whys analysis

3. Impact Assessment

Users affected
Revenue impact
Reputation impact
Data impact

4. Prevention Measures

What monitoring would have caught this sooner?
What testing would have prevented this?
What process changes are needed?
What infrastructure changes are needed?

5. Action Items

[Action] → [Owner] → [Deadline]
[Action] → [Owner] → [Deadline]
[Action] → [Owner] → [Deadline]

Output: Post-Mortem Report → Sprint Prioritizer adds prevention tasks to backlog

Communication Templates

Status Page Update (Support Responder)

Status Page Template

[TIMESTAMP] — [SERVICE NAME] Incident

Status: [Investigating / Identified / Monitoring / Resolved]
Impact: [Description of user impact]
Current action: [What we're doing about it]
Next update: [When to expect the next update]

Executive Update (Executive Summary Generator — P0 only)

Executive Brief Template

INCIDENT BRIEF — [TIMESTAMP]

SITUATION: [Service] is [down/degraded] affecting [N users/% of traffic]
CAUSE: [Known/Under investigation] — [Brief description if known]
ACTION: [What's being done] — ETA [time estimate]
IMPACT: [Business impact — revenue, users, reputation]
NEXT UPDATE: [Timestamp]

Escalation Matrix

Critical Escalation Points

Condition	Escalate To	Action
P0 not resolved in 30 min	Studio Producer	Additional resources, vendor escalation
P1 not resolved in 2 hours	Project Shepherd	Resource reallocation
Data breach suspected	Legal Compliance Checker	Regulatory notification assessment
User data affected	Legal Compliance + Executive Summary	GDPR/CCPA notification
Revenue impact > $X	Finance Tracker + Studio Producer	Business impact assessment

Speed matters, but so does documentation. Every incident is a learning opportunity to improve the system.

Overview

Playbooks

Coordination

Runbooks

Runbook: Incident Response

Runbook: Incident Response

Scenario

Severity Classification

Response Teams by Severity

P0 — Critical Response Team

Incident Commander

Deployment/Rollback

Root Cause Investigation

UI Investigation

User Communication

Stakeholder Comms

P1 — High Response Team

P2 — Medium Response

P3 — Low Response

Incident Response Sequence

Step 1: Detection & Triage (0-5 minutes)

Step 2: Investigation (5-30 minutes)

Step 3: Mitigation (15-60 minutes)

Step 4: Resolution Verification (Post-fix)

Evidence Collector

Infrastructure Maintainer

API Tester

Step 5: Post-Mortem (Within 48 hours)

Communication Templates

Status Page Update (Support Responder)

Executive Update (Executive Summary Generator — P0 only)

Escalation Matrix

Build docs developers (and LLMs) love

Overview

Playbooks

Coordination

Runbooks

​Runbook: Incident Response

​Scenario

​Severity Classification

​Response Teams by Severity

​P0 — Critical Response Team

Incident Commander

Deployment/Rollback

Root Cause Investigation

UI Investigation

User Communication

Stakeholder Comms

​P1 — High Response Team

​P2 — Medium Response

​P3 — Low Response

​Incident Response Sequence

​Step 1: Detection & Triage (0-5 minutes)

​Step 2: Investigation (5-30 minutes)

​Step 3: Mitigation (15-60 minutes)

​Step 4: Resolution Verification (Post-fix)

Evidence Collector

Infrastructure Maintainer

API Tester

​Step 5: Post-Mortem (Within 48 hours)

​Communication Templates

​Status Page Update (Support Responder)

​Executive Update (Executive Summary Generator — P0 only)

​Escalation Matrix

Build docs developers (and LLMs) love

Runbook: Incident Response

Scenario

Severity Classification

Response Teams by Severity

P0 — Critical Response Team

P1 — High Response Team

P2 — Medium Response

P3 — Low Response

Incident Response Sequence

Step 1: Detection & Triage (0-5 minutes)

Step 2: Investigation (5-30 minutes)

Step 3: Mitigation (15-60 minutes)

Step 4: Resolution Verification (Post-fix)

Step 5: Post-Mortem (Within 48 hours)

Communication Templates

Status Page Update (Support Responder)

Executive Update (Executive Summary Generator — P0 only)

Escalation Matrix