Real-Time Alerts: Don't Delay
Delays cost. Real-time fixes quick.
Why Immediate
E-comm: Gateway fail 2PM, 5-min detect 2:05, fix 2:25 (25 min loss), $50k gone. 1-min: Detect 2:01, fix 2:21 (21 min), saves $40k.
SaaS: DB issue 10:30, 5-min detect 10:35, restart 10:37 (7 min), 200 tickets. 1-min: 10:31 detect, 10:33 restart (3 min), 80 tickets.
API: 500s 8 min undetected, cascades. Real-time stops it.
Benefits
Quick Response
- Seconds notify
- Stop escalation
- Mobilize fast
- Communicate proactive
Less Downtime
- Faster fixes
- Fewer users hit
- Trust preserved
- Losses lowered
Better UX
- Always available
- Seamless
- Perf maintained
- Confidence built
Team Boost
- Solve focused
- Less stress
- Better balance
- Data decisions
Alert Types
Availability
HTTP:
const availability = {
trigger: {
statusCode: [500, 502, 503, 504],
consecutiveFailures: 1,
timeout: 30000
},
notification: {
channels: ['slack', 'email', 'sms'],
severity: 'critical',
escalation: {
afterMinutes: 5,
toTeam: 'on-call'
}
}
};
DNS/connectivity, SSL, CDN.
Performance
Response:
response_time_alerts:
warning_threshold: 2000ms
critical_threshold: 5000ms
measurement_window: 3
actions:
warning:
- notify_chat
- log_issue
critical:
- page_engineer
- auto_scale
- notify_stakeholders
Resource: CPU, memory, disk, DB pools.
Biz Logic
Journeys: Reg, payment, search, APIs.
Custom: Cart abandon, login success, uploads, integrations.
Setup
Channels
Multi:
class Notifier:
def __init__(self):
self.channels = {
'slack': SlackNotifier(),
'discord': DiscordNotifier(),
'email': EmailNotifier(),
'sms': SMSNotifier(),
'webhook': WebhookNotifier()
}
def send(self, alert, severity='medium'):
if severity == 'critical':
for channel in self.channels.values():
channel.send_immediate(alert)
elif severity == 'high':
self.channels['slack'].send(alert)
self.channels['email'].send(alert)
self.channels['webhook'].send(alert)
elif severity == 'medium':
self.channels['slack'].send(alert)
self.channels['email'].send(alert)
else:
self.channels['slack'].send(alert)
Team routing: Time, type, expertise, escalation.
Frequency/Timing
Group:
const grouping = {
groupingWindow: 300000, // 5 min
shouldGroup: (newAlert, groups) => {
return groups.find(group =>
group.service === newAlert.service &&
group.alertType === newAlert.alertType &&
(Date.now() - group.lastAlert) < groupingWindow
);
},
createMessage: (alerts) => ({
title: `${alerts.length} alerts for ${alerts[0].service}`,
summary: alerts.map(a => a.message).join('\n'),
severity: Math.max(...alerts.map(a => a.severity)),
actions: ['View', 'Ack All', 'Escalate']
})
};
Escalation: 0 min primary, 5 min secondary, 15 min manager, 30 min exec.
Content
Actionable:
{
"alert": {
"title": "API Down - Payments",
"severity": "critical",
"timestamp": "2024-01-15T14:30:00Z",
"service": {
"name": "Payment API",
"url": "https://api.example.com/payments",
"environment": "prod"
},
"issue": {
"status_code": 500,
"response_time": "timeout",
"error_message": "Server Error",
"affected_users": "150+"
},
"context": {
"recent_deployments": "v2.1.3 2 hours ago",
"traffic_pattern": "normal",
"dependencies": ["db", "redis", "gateway"]
},
"suggested_actions": [
"Check logs",
"Verify db",
"Rollback v2.1.2",
"Contact gateway"
],
"runbook": "https://docs.company.com/runbooks/payment-api-issues",
"dashboard": "https://monitor.company.com/payment-api"
}
}
Quickstart exit1.dev
exit1 add https://api.myapp.com/health
--name "API Health"
--interval 60
--timeout 30
--expected-status 200
--alert-on-failure
--alert-on-recovery
exit1 alert add-channel slack
--webhook-url "https://hooks.slack.com/..."
--channel "#alerts"
--severity critical,high
exit1 alert add-channel discord
--webhook-url "https://discord.com/api/webhooks/..."
--severity critical
exit1 alert add-channel email
--addresses "team@company.com,oncall@company.com"
--severity high,medium
Advanced
Predictive
Trend:
const predictive = {
metric: 'response_time',
analysis_window: '15_minutes',
prediction_window: '5_minutes',
trigger: {
trend_direction: 'increasing',
trend_slope: 0.2,
confidence_threshold: 0.8
},
action: {
message: "Response time trending up - issue brewing",
severity: 'warning',
suggested_actions: [
'Check resources',
'Review deploys',
'Monitor spikes'
]
}
};
Context Alerting
Business-aware:
import datetime
class AwareAlerting:
def __init__(self):
self.business_hours = {
'start': 9,
'end': 17,
'timezone': 'UTC',
'weekdays_only': True
}
def adjust_severity(self, base_severity, timestamp):
dt = datetime.datetime.fromtimestamp(timestamp)
if self.is_business_hours(dt):
severity_map = {
'low': 'medium',
'medium': 'high',
'high': 'critical'
}
return severity_map.get(base_severity, base_severity)
elif base_severity in ['low', 'medium']:
return base_severity # Delay escalation
return base_severity
Conclusion
Real-time necessary. Implement smart.
exit1.dev foundation: 1-min checks, global, intelligent alerts. With config, escalation, improvement, first defense.
Goal: Right alerts, right time, right info. Reduces stress, boosts efficiency, keeps sites running.
Try exit1.dev here. Alerts that help, not overwhelm.
Sources
- Google SRE Book: Monitoring Distributed Systems — https://sre.google/sre-book/monitoring-distributed-systems/
- AWS Well-Architected: Reliability Pillar — https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
Recommended Free Monitoring Resources
- Free Uptime Monitor Checklist – Step-by-step actions to configure a free uptime monitor that catches incidents fast.
- Best Free Uptime Monitoring Tools (2025) – Compare the strongest free uptime monitor platforms and when to upgrade.
- Free Website Monitoring Tools 2025 Guide – Evaluate which free website monitor fits your stack and alerting needs.
- Free Website Monitoring for Developers – See how engineering teams automate alerts, SLO tracking, and reporting with a free website monitor.
Morten Pradsgaard is the founder of exit1.dev — the free uptime monitor for people who actually ship. He writes no-bullshit guides on monitoring, reliability, and building software that doesn't crumble under pressure.