Design an Incident Handling Runbook for Third-Party Outages (Cloudflare, AWS, X)
A reproducible SRE runbook to detect, mitigate, and communicate during Cloudflare, AWS, or X outages—templates, scripts, and checklists for teams.
When Cloudflare, AWS or X goes down, your customers don’t care whose fault it is — they care that your product is unavailable. This runbook gives you a reproducible SRE playbook and checklist to detect, mitigate, and communicate during major third‑party outages in 2026.
Third‑party outages are no longer rare anomalies. With highly distributed edge infrastructure, API ecosystems, and aggressive product velocity, teams need a tested runbook that’s executable under pressure. Below you’ll find a prioritized, practical incident response runbook plus ready‑to‑paste alert rules, communication templates, failover patterns, and a postmortem checklist that development teams can adopt today.
Why third‑party outages are a top risk in 2026
Late 2025 and early 2026 saw spiky, simultaneous reports of outages across major platforms—CDNs, cloud providers, and social platforms—highlighting a few trends:
- Higher service coupling: More products depend on multi‑vendor stacks (CDN + cloud + identity + analytics), increasing blast radius when one vendor hiccups.
- Edge and API proliferation: Edge compute and third‑party APIs lowered latency but raised failure modes that are hard to reproduce in dev.
- Automated escalation is mainstream: Teams now expect programmatic failover and status updates within minutes, not hours.
- AI‑assisted incident triage: In 2025 many SRE teams began augmenting human incident commanders with LLM agents for log summarization and draft communications. Use them, but keep human oversight — see guidance on secure desktop AI agents in creating a secure desktop AI agent policy.
Core SRE principles for third‑party outage runbooks
- Detect fast: Synthetic checks and end‑user telemetry are your early warning systems.
- Fail safe: Design safe degradation and deterministic failover — avoid cascading failures.
- Communicate clearly: Internal clarity + external transparency = reduced customer churn and calm stakeholders.
- Automate repeatable actions: Turn manual mitigations into tested scripts or IaC when safe.
- Learn quickly: Postmortems drive durable fixes and playbook updates.
Runbook overview: roles, phases, decision points
Primary roles (assign before you need them)
- Incident Commander (IC): Drives decisions, declares severity, assigns tasks.
- Communications Lead: Status page and stakeholder updates.
- Engineering Lead(s): Execute technical mitigations (DNS, routing, caching, feature toggles).
- On‑call Observability Owner: Run diagnostics, update alerting.
- Customer Success Liaison: Prepares customer‑facing Q&A.
Incident phases
- Detection & Triage
- Containment & Mitigation
- Communication
- Recovery & Validation
- Postmortem & Remediation
Step‑by‑step reproducible runbook
1) Detection — observability checklist
- Enable multi‑region synthetic checks for critical user journeys (login, payment, API requests). Run every 30–60s.
- Correlate end‑user errors (5xx rates), latency P95/P99, and client telemetry (RUM). If all three spike, escalate.
- Monitor third‑party provider status pages and RSS feeds programmatically. Treat provider degraded as a signal, not proof — reconcile with your own telemetry.
- Alert example (Prometheus rule):
# Prometheus alert: high 5xx rate across prod
- alert: High5xxRate
expr: sum(rate(http_requests_total{job="frontend",status=~"5.."}[2m])) / sum(rate(http_requests_total{job="frontend"}[2m])) > 0.03
for: 2m
labels:
severity: page
annotations:
summary: "High 5xx rate (>3%) in frontend"
runbook: "https://internal/runbooks/third-party-outage"
2) Triage — initial quick checklist (first 10 minutes)
- IC declares incident if any of: sustained synthetic failure, >3% global 5xx for >2m, or critical external dependency degraded.
- Run a targeted test to confirm: curl, traceroute, and DNS checks from multiple regions.
- Check provider status APIs (Cloudflare, AWS, X) and Down Detector equivalents. Log timestamps.
- Decide whether to monitor only, execute mitigations, or failover (see decision matrix below).
Decision matrix: when to failover
- Failover if: primary provider outage impacts >50% of traffic for >5 minutes and automated failover has been tested.
- Mitigate without failover if: degraded service but partial functionality exists and failover risks data divergence.
- Monitor only if: anomaly is confined to a subset of noncritical regions or feature flags can gate affected features.
3) Containment & mitigation — technical playbook
Apply mitigations in order of least risk to highest risk. Automate reversible steps first.
- Rate-limit and prioritize traffic: Apply emergency rate limits or circuit breakers on nonessential endpoints. Use API gateway or edge rules to protect core flows.
- Enable degraded mode: Turn off noncritical features (analytics, recommendations) via feature flags to reduce load on failing third‑party APIs.
- Cache aggressively: Increase TTL on CDN and application caches for read‑heavy endpoints. Use stale‑while-revalidate where supported.
- Switch CDN or multi‑CDN routing: If Cloudflare is the vector and you have a multi‑CDN setup, trigger automated traffic steering to the secondary CDN. Make sure TLS, WAF, and cache rules are aligned. For multi-control plane patterns and routing at scale see Edge-First Live Production patterns.
- DNS failover (with caution): Use health checks and low TTLs only if you’ve tested DNS failover before. DNS propagation can be messy. Route 53 / Cloud DNS automation example (Terraform snippet):
# Terraform Route53 simple failover record (conceptual)
resource "aws_route53_record" "www_primary" {
zone_id = var.zone_id
name = "www"
type = "A"
ttl = 60
set_identifier = "primary"
failover = "PRIMARY"
records = [aws_instance.app_primary.public_ip]
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "www_secondary" {
zone_id = var.zone_id
name = "www"
type = "A"
ttl = 60
set_identifier = "secondary"
failover = "SECONDARY"
records = [aws_instance.app_secondary.public_ip]
}
Note: Test DNS failover during low traffic windows. For customer‑facing SaaS, prefer HTTP‑level failover using CDN or global load balancing.
4) Communication — internal and external templates
Clear, consistent communication reduces noise. Use the Communications Lead to push updates on cadence (every 15 minutes for a major outage until stabilized).
Internal status (Slack/Teams):
INCIDENT: Third‑Party CDN Outage (Cloudflare) — SEV2
Time: 2026-01-16T10:32Z
Status: Investigating
Impact: API latencies elevated globally; login failures ~12%
Action: IC declared. Engineering triaging CDN failover and increasing cache TTLs.
Next update: in 15 minutes
Runbook: https://internal/runbooks/third-party-outage
External status page template:
Title: Service degradation due to third‑party CDN outage
Started: 2026‑01‑16T10:30Z
Impact: Customers may experience intermittent errors and increased page load times. Core transactions are processing; noncritical features may be degraded.
Mitigation: We are working with the CDN provider and executing automated traffic steering. Next update: in 15 minutes.
Best practice: Post a single canonical update on your public status page and link to it from social channels. Avoid conflicting messages.
5) Recovery & validation
- Formalize a recovery checklist before restoring full functionality: roll back emergency feature flags slowly, reduce cache TTLs gradually, reenable background jobs once core flows are green.
- Validate with synthetic checks and real user monitoring: require three green cycles across regions before declaring recovered.
- Document exact timestamps for mitigation start, failover actions, and full recovery for SLA calculations and postmortem.
6) Postmortem & action items
Use a blameless postmortem template. Capture:
- Timeline with timestamps
- Root cause (primary and contributing factors)
- Mitigations executed and their effectiveness
- Actionable remediation with owners and deadlines
Close the loop: one action item per incident must be scheduled within 7 days or reprioritized in backlog sprint planning.
Automations & runbook as code
Make the runbook executable:
- Store runbook steps in a git repo and require PRs for changes. Version the runbook.
- Encode reversible actions as scripts and add tests. Example: a script that toggles a feature flag via your launchdarkly/flagsmith API. For automating safe partner flows and onboarding, see AI-assisted automation patterns.
- Integrate provider status APIs to enrich alerts — but never rely solely on them.
- Use incident simulation and chaos engineering (multi‑CDN failover drills, simulated API failures) quarterly to validate the runbook.
Cheat sheet: Tactical commands and snippets
Quick network & DNS checks
# from a regional runner
curl -I https://yourapp.example.com
dig +short CNAME yourapp.example.com @8.8.8.8
traceroute -m 30 yourapp.example.com
Simple synthetic monitor (bash)
#!/bin/bash
# simple health check
URL=https://yourapp.example.com/health
STATUS=$(curl -s -o /dev/null -w "%{http_code}" $URL)
if [ "$STATUS" != "200" ]; then
echo "HEALTH_FAIL $STATUS $(date -u)"
# call alert webhook
else
echo "HEALTH_OK $(date -u)"
fi
Checklist: Incident handling for third‑party outages (printable)
- Confirm anomaly with synthetic checks and RUM.
- IC declared + roles assigned.
- Run targeted diagnostics (curl/dig/traceroute) from multiple regions.
- Programmatically fetch provider status pages and log results.
- Mitigate: enable degraded mode, increase cache TTLs, apply rate limits.
- If pretested: trigger CDN/edge failover or DNS failover per decision matrix.
- Post canonical updates every 15 minutes until stabilized.
- Validate recovery with three green synthetic cycles across regions.
- Run blameless postmortem and schedule remediation items within 7 days.
Advanced strategies and 2026 predictions
- Multi‑control plane resilience: In 2026 expect more teams to deploy multi‑CDN + multi‑cloud patterns for critical surface area. This pushes the need for automated routing policies and unified observability. See multi-control plane patterns in the micro-regions & edge-first hosting analysis.
- Runbooks integrated with AI assistants: LLMs will draft incident timelines and triage suggestions, but organizations must validate outputs and guardrails (no automated failover without human approval unless explicitly tested). Guidance on safe AI agent policies is available at creating a secure desktop AI agent policy.
- Shift left for third‑party failures: Design for graceful degradation in the earliest architecture reviews rather than tacking it on later.
- Regulatory attention: As outages affect financial and healthcare services, expect stricter reporting obligations in some jurisdictions — keep rigorous logs.
Actionable takeaways
- Always test failover paths under low traffic. Untested failover is a risk.
- Program synthetic checks and provider status pulls — detection is faster when you combine signals. For serverless scheduling and observability workflows, see Calendar Data Ops.
- Keep communications simple: canonical status page + timed updates.
- Convert manual mitigations to safe, reversible automation and treat the runbook as code. Offline-first edge strategies and resilient field apps are covered in offline-first edge nodes.
- Run blameless postmortems and force at least one actionable remediation within a week.
Final checklist (one‑page)
- IC declared: Y/N
- Impact assessed (regions/users):
- Provider status checked: Cloudflare / AWS / X
- Mitigations executed: cache / feature flags / failover
- External status posted and linked
- Recovery validated by synthetic tests
- Postmortem scheduled
Call to action
If you run production services, adopt this runbook as code today. Fork the sample scripts and alert rules into your incident repo, schedule a failover drill this quarter, and run a blameless postmortem template at the end of the drill. Need a ready‑to‑use runbook template with Slack and status page integrations? Download the companion runbook kit and automated snippets from our repo or get in touch for a hands‑on resilience review.
Related Reading
- Postmortem: What the Friday X/Cloudflare/AWS Outages Teach Incident Responders
- Chaos Engineering vs Process Roulette: Using 'Process Killer' Tools Safely for Resilience Testing
- Micro‑Regions & the New Economics of Edge‑First Hosting in 2026
- Creating a Secure Desktop AI Agent Policy: Lessons from Anthropic’s Cowork
- Phone-Scanned Museums: London Galleries Using 3D, AR and New Tech to Bring Art to Life
- Record-Low Bluetooth Micro Speaker: Is Amazon’s Deal Better Than Bose — Our Pocket-Sized Price Comparison
- Bundle Ideas: Matching Human and Pet Warmers for Ultimate Cosiness
- Personalize Your Dating Event: Lessons from Virtual Fundraising That Boost Engagement
- UX and Accessibility Compatibility: Are Personalized Insoles Helping or Harming?
Related Topics
thecode
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you