Runbook: Incident Response
Mode: NEXUS-Micro | Duration: Minutes to hours | Agents: 3-8
Scenario
Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem.Severity Classification
| Level | Definition | Examples | Response Time |
|---|---|---|---|
| P0 — Critical | Service completely down, data loss, security breach | Database corruption, DDoS attack, auth system failure | Immediate (all hands) |
| P1 — High | Major feature broken, significant performance degradation | Payment processing down, 50%+ error rate, 10x latency | < 1 hour |
| P2 — Medium | Minor feature broken, workaround available | Search not working, non-critical API errors | < 4 hours |
| P3 — Low | Cosmetic issue, minor inconvenience | Styling bug, typo, minor UI glitch | Next sprint |
Response Teams by Severity
P0 — Critical Response Team
Incident Commander
Infrastructure MaintainerAssess scope, coordinate response
Deployment/Rollback
DevOps AutomatorExecute rollback if needed
Root Cause Investigation
Backend ArchitectDiagnose system issues
UI Investigation
Frontend DeveloperDiagnose client-side issues
User Communication
Support ResponderStatus page updates, user notifications
Stakeholder Comms
Executive Summary GeneratorReal-time executive updates
P1 — High Response Team
- Infrastructure Maintainer (Incident commander)
- DevOps Automator (Deployment support)
- Relevant Developer Agent (Fix implementation)
- Support Responder (User communication)
P2 — Medium Response
- Relevant Developer Agent (Fix implementation)
- Evidence Collector (Verify fix)
P3 — Low Response
- Sprint Prioritizer (Add to backlog)
Incident Response Sequence
Step 1: Detection & Triage (0-5 minutes)
Infrastructure Maintainer Actions
- Acknowledge alert
- Assess scope and impact
- How many users affected?
- Which services are impacted?
- Is data at risk?
- Classify severity (P0/P1/P2/P3)
- Activate appropriate response team
- Create incident channel/thread
Step 2: Investigation (5-30 minutes)
- Infrastructure Maintainer
- Backend Architect (P0/P1)
- DevOps Automator
Parallel Investigation:
- Check system metrics (CPU, memory, network, disk)
- Review error logs
- Check recent deployments
- Verify external dependencies
Step 3: Mitigation (15-60 minutes)
Decision Tree
Decision Tree
Throughout Mitigation:
- Support Responder: Update status page every 15 minutes
- Executive Summary Generator: Brief stakeholders (P0 only)
Step 4: Resolution Verification (Post-fix)
Evidence Collector
- Verify fix resolves the issue
- Screenshot evidence of working state
- Confirm no new issues introduced
Infrastructure Maintainer
- Verify all metrics returning to normal
- Confirm no cascading failures
- Monitor for 30 minutes post-fix
API Tester
If API-related:
- Run regression on affected endpoints
- Verify response times normalized
- Confirm error rates at baseline
Step 5: Post-Mortem (Within 48 hours)
Workflow Optimizer leads post-mortem
Workflow Optimizer leads post-mortem
1. Timeline Reconstruction
- When was the issue introduced?
- When was it detected?
- When was it resolved?
- Total user impact duration
- What failed?
- Why did it fail?
- Why wasn’t it caught earlier?
- 5 Whys analysis
- Users affected
- Revenue impact
- Reputation impact
- Data impact
- What monitoring would have caught this sooner?
- What testing would have prevented this?
- What process changes are needed?
- What infrastructure changes are needed?
- [Action] → [Owner] → [Deadline]
- [Action] → [Owner] → [Deadline]
- [Action] → [Owner] → [Deadline]
Communication Templates
Status Page Update (Support Responder)
Status Page Template
Status Page Template
Executive Update (Executive Summary Generator — P0 only)
Executive Brief Template
Executive Brief Template
Escalation Matrix
| Condition | Escalate To | Action |
|---|---|---|
| P0 not resolved in 30 min | Studio Producer | Additional resources, vendor escalation |
| P1 not resolved in 2 hours | Project Shepherd | Resource reallocation |
| Data breach suspected | Legal Compliance Checker | Regulatory notification assessment |
| User data affected | Legal Compliance + Executive Summary | GDPR/CCPA notification |
| Revenue impact > $X | Finance Tracker + Studio Producer | Business impact assessment |
Speed matters, but so does documentation. Every incident is a learning opportunity to improve the system.
