Part of Platform & Security

Claude Code Skills for Incident Management

Incidents are inevitable. How you respond to them is what separates mature engineering orgs from chaotic ones. These skills cover the full incident lifecycle: preparation through chaos engineering, response through playbooks and on-call processes, and learning through blameless postmortems and disaster recovery planning. The goal is to make incidents boring — handled by process, not heroics.

Published by ClaudeVaultLast updated 5 skills

Key takeaway

ClaudeVault's incident management skills give Claude Code structured workflows for every phase of the incident lifecycle — chaos engineering experiments with Gremlin or AWS FIS to find weaknesses before real failures do, runbook-backed playbooks for structured response, primary-secondary on-call rotations with escalation policies, blameless postmortems that quantify impact and assign action items, and disaster recovery plans with tested RTO and RPO targets.

At a glance

  • 5 skills covering chaos engineering, disaster recovery planning, incident playbooks, blameless postmortems, and on-call process design
  • Generates chaos engineering experiments for Gremlin, AWS Fault Injection Simulator, and Chaos Mesh, with guard rails that limit blast radius and automatic rollback triggers
  • Follows Google SRE blameless postmortem practice: held within 24 to 72 hours, impact quantified by duration, error rate, affected users, and revenue, with every action item assigned an owner and deadline
  • Designs primary-secondary on-call rotations where last week's primary becomes this week's secondary for context continuity, with escalation at 5, 15, and 30 minutes
  • Sets RTO and RPO targets by service criticality tier, from minutes for tier-one systems to days for tier-four, with recovery testing schedules that verify the plan actually works

When you reach for these skills

  • When incidents are resolved by whoever happens to be online rather than a structured rotation with clear escalation paths and documented playbooks

  • When postmortems are blame sessions that produce a list of who made mistakes instead of systemic improvements that prevent recurrence

  • When the team has never tested whether their backups actually restore or whether their disaster recovery plan works under time pressure

  • When production failures keep surprising the team because nobody has run controlled experiments to discover the system's breaking points

How these skills work together

A Claude Code incident management workflow builds from proactive preparation through reactive response to retrospective learning, because each phase feeds the next and the cycle only improves if all three are running.

  1. 1

    Design chaos experiments to discover weaknesses proactively

    Start with the chaos engineering designer. Claude creates controlled fault injection experiments — network latency, CPU exhaustion, disk pressure, availability zone loss — using Gremlin for multi-cloud environments or AWS FIS for AWS-only infrastructure. Every experiment has a hypothesis, blast radius limits, and an automatic rollback trigger.

  2. 2

    Build incident playbooks for structured response

    The incident playbook designer generates runbooks for the failure modes chaos engineering surfaced. Claude writes step-by-step response procedures tied to specific alert conditions, with decision trees for severity classification and communication templates for stakeholder updates.

  3. 3

    Set up on-call rotations with escalation policies

    Use the on-call process designer to build a primary-secondary rotation. Claude structures the schedule so last week's primary becomes this week's secondary for context continuity, sets escalation timers at 5, 15, and 30 minutes, and routes only high-severity alerts to wake-up pages — everything else gets bundled for business hours.

  4. 4

    Write blameless postmortems that drive systemic fixes

    After an incident, the postmortem writer generates the document within the 24-to-72 hour window while recall is fresh. Claude quantifies impact — duration, error rate, affected users, revenue — builds the timeline from logs and chat transcripts, identifies contributing factors without blame, and assigns every action item to a named owner with a deadline.

  5. 5

    Plan disaster recovery with tested RTO and RPO targets

    The disaster recovery planner classifies services into criticality tiers and sets recovery time and recovery point objectives for each tier. Claude generates the recovery procedure, schedules quarterly recovery drills, and builds the verification checklist that proves the plan works before it needs to.

Outcome

Chaos experiments that surface weaknesses before customers find them, playbooks that turn incidents into procedure instead of improvisation, on-call rotations with context continuity, postmortems that drive systemic improvement, and a disaster recovery plan tested against real RTO and RPO targets.

Compare the skills

SkillBest forComplexityPrimary use case
Chaos Engineering DesignerProactive failure discoveryAdvancedControlled fault injection with Gremlin, AWS FIS, or Chaos Mesh
Incident Playbook DesignerStructured incident responseIntermediateRunbooks tied to alert conditions with severity classification
On-Call Process DesignerRotation and escalation designIntermediatePrimary-secondary schedules with PagerDuty or incident.io integration
Incident Postmortem WriterBlameless retrospectivesBeginnerImpact quantification, timeline reconstruction, and action item assignment
Disaster Recovery PlannerBusiness continuity planningAdvancedRTO/RPO targets by criticality tier with recovery testing schedules

Skills in this topic

Chaos Engineering Designer

Designs controlled chaos experiments with steady-state hypotheses and blast radius controls. Use when planning failure injection, testing resilience assumptions, or running game days. Chaos Monkey, Litmus, Gremlin, fault tolerance.

Design controlled failure experiments that produce actionable learning.

Incident Postmortem Writer

Writes structured, blameless postmortems with timelines and root cause analysis. Use when documenting a resolved incident, conducting 5-Whys analysis, or generating action items from outage data. Incident review, systemic gaps, error budget.

Turn chaotic incident recollections into clear, actionable postmortems that prevent recurrence.

Disaster Recovery Planner

Designs disaster recovery plans with RTO/RPO targets and failover architecture. Use when planning for regional outages, choosing between active-active and warm standby, or scheduling DR drills. Failover, business continuity, multi-region.

Design DR plans as if the disaster will happen on a Friday evening when the senior engineer is on vacation. Every procedure must be executable by the least experienced person on the on-call rotation.

On Call Process Designer

Designs sustainable on-call systems with rotation schedules, escalation policies, and handoff procedures. Use when formalizing an ad-hoc on-call rotation, addressing burnout from uneven page distribution, or setting up follow-the-sun coverage. PagerDuty, Opsgenie, compensation, page budgets.

Design on-call systems where engineers can have a life outside work while still being available when production genuinely needs them.

Incident Playbook Designer

Creates incident response playbooks with severity classifications and communication templates. Use when designing runbooks for specific failure modes, defining escalation triggers, or standardizing incident communication. MTTR, triage, status page.

Design playbooks that a stressed, sleep-deprived engineer can follow at 3 AM. Every step must be concrete and verifiable. "Check the database" is not a step.

Frequently asked questions

How do I run a blameless postmortem?

State the blameless principle at the start of the meeting, focus on systemic contributing factors rather than individual mistakes, and quantify impact with specific numbers — incident duration, error rate, affected user count, revenue impact, and support ticket volume. Hold the postmortem within 24 to 72 hours while recall is fresh, and assign every action item to a named owner with a deadline.

What is chaos engineering and should I run it in production?

Chaos engineering is controlled fault injection designed to discover weaknesses before real failures do. Start with experiments in staging — network latency, CPU exhaustion, dependency outages — and graduate to production once the team has confidence in guard rails and automatic rollback triggers. Gremlin supports multi-cloud and on-premises environments; AWS FIS covers AWS services at roughly 10 cents per minute per action.

What is the best on-call rotation model?

For teams of six to eight engineers, a weekly primary-secondary rotation where last week's primary becomes this week's secondary provides context continuity without burnout. Follow-the-Sun works for nine-plus engineers across three or more time zones. Only high-severity alerts should trigger wake-up pages; lower-severity incidents get bundled for business hours to reduce alert fatigue.

AWS FIS vs Gremlin — which chaos engineering tool should I choose?

AWS FIS is native to AWS at roughly 10 cents per minute per action, with experiments scoped to EC2, ECS, EKS, and RDS. Gremlin supports AWS, GCP, Azure, and on-premises infrastructure with a broader attack library and enterprise features. Pick FIS for AWS-only environments with simple experiments, Gremlin for multi-cloud or hybrid infrastructure.

How do I set RTO and RPO targets for my services?

Classify services into criticality tiers based on revenue impact and user-facing importance. Tier one — payment processing, auth — typically needs minutes for both RTO and RPO. Tier four — internal tools, batch reports — can tolerate hours or days. The disaster recovery planner generates per-tier targets and schedules quarterly drills that verify recovery actually meets them.

How do I reduce on-call alert fatigue?

Route only genuinely urgent conditions to wake-up pages and bundle everything else for business hours review. Tune alert thresholds so they fire on actual degradation, not normal variance. Deduplicate and suppress related alerts through PagerDuty or incident.io so one cascading failure produces one page, not thirty.