Downtime: Causes and Kills
Downtime costs. Understand, prevent.
Costs
Financial: Downtime drives direct revenue loss and operational expense; impact varies by business model (source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html).
Reputation: Users abandon slow or unavailable sites; slower load times increase bounce probability (source: https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/page-load-time-statistics/). SEO can also be affected by poor availability and performance.
Operational: Productivity loss, support burden, stress.
Causes
- Hardware Fails
- Drives crash, RAM dies, CPU hot, PSU gone, net card bad.
Prevent: Redundant hardware, health monitoring, cooling, spares, cloud.
- Software Bugs
- Bad deploys, DB migrations, OS updates, plugin conflicts, leaks.
Prevent: Testing, staging, canary, rollbacks, perf monitoring.
- Traffic/DDOS
- Viral, campaigns, attacks, bots, sales.
Prevent: Auto-scale, CDNs, DDOS protection, traffic monitoring, capacity planning.
- DB Issues
- Crashes, corruption, slow queries, disk full, pool exhausted.
Prevent: Backups, optimization, replication, monitoring, pooling.
- Net/DNS
- ISP out, DNS fail, routing bad, cable cuts, config errors.
Prevent: Multi DNS, geo failover, net monitoring, multi ISPs, audits.
- Security
- Malware, ransomware, SQL injection, XSS, brute force.
Prevent: Updates, auth, WAF, audits, training.
- Human Error
- Deletes, config bad, wrong env deploys, query mistakes, firewall mess.
Prevent: Change mgmt, auto deploys, reviews, doc, training.
- Third-Party
- Payment out, CDN issues, cloud disrupts, rate limits, cert expires.
Prevent: Diversify, degradation, monitor status, backup payments, auto renew.
- Resource Exhaust
- Disk full, memory out, CPU 100%, bandwidth sat, DB connections max.
Prevent: Resource monitoring, alerts, auto-scale, cleanup, optimize.
- Environmental
- Power out, disasters, construction, weather, instability.
Prevent: Redundant power, geo distribution, recovery plans, multi-cloud, backups.
Prevention
Layered: Infra, monitoring, auto response, maintenance.
Infra
Redundancy:
Primary DC (US-East)
├── LB (2x)
├── Web (3x)
├── DB Cluster (Master + 2 Slaves)
└── Backup (UPS + Gen)
Secondary DC (US-West)
├── Standby
├── Real-time Rep
└── Auto Failover
Monitoring
exit1.dev: 1-min, global, intelligent alerts.
Uptime: 1-min multi-location, HTTP/HTTPS, SSL, DNS.
Perf: Response, load, DB queries, CDN/static.
Biz: Journeys, APIs, payments, search.
Auto Response
Scale on traffic, failover backups, health removal, cache warming.
Alerts: Immediate critical, escalation.
Maintenance
Security updates, DB opt, log clean, health checks.
Recovery tests: Monthly failover, backup restore, net failover, comm practice.
Checklist
- Redundancy
- Monitoring
- Backups
exit1.dev
Detects 1-min, global, alerting. Data for patterns.
Alerting: Multi-location, retries, contextual, integrations.
Coverage: Website, API, SSL, DNS.
Response Plan
Prep: Doc systems, contacts, channels, roles.
Detect: Alerts, reports, notifications, escalation.
Response: Assess, mobilize, update status, investigate.
Recovery: Restore confirm, monitor perf, communicate res, analyze.
Learn: Root cause, improvements, training, monitoring enhance.
Measure
Availability: Uptime %, MTTD, MTTR, incidents/mo.
Perf: Response avg, load trends, error %, UX scores.
Biz: Revenue lost, satisfaction, tickets, retention.
Improve: Monthly reviews, quarterly assessments, annual plans, evaluations.
Conclusion
Prevent > react.
exit1.dev foundation. 1-min, global, intelligent. With planning, response, learning, high availability.
Goal: Minimize frequency, duration, learn.
Related: 101, Alerts, Real-time, Best Practices
Monitor with exit1.dev here. Catch before outages.
Sources
- AWS Well-Architected: Reliability Pillar — https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
- Google SRE Book: Managing Incidents — https://sre.google/sre-book/managing-incidents/
- Cloudflare Learning: What is a DDoS attack? — https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/
Recommended Free Monitoring Resources
- Free Uptime Monitor Checklist – Step-by-step actions to configure a free uptime monitor that catches incidents fast.
- Best Free Uptime Monitoring Tools (2025) – Compare the strongest free uptime monitor platforms and when to upgrade.
- Free Website Monitoring Tools 2025 Guide – Evaluate which free website monitor fits your stack and alerting needs.
- Free Website Monitoring for Developers – See how engineering teams automate alerts, SLO tracking, and reporting with a free website monitor.
Morten Pradsgaard is the founder of exit1.dev — the free uptime monitor for people who actually ship. He writes no-bullshit guides on monitoring, reliability, and building software that doesn't crumble under pressure.