DNSSEC Rollover Gone Wrong: Lessons from High-Profile Failures
DNSSEC is unforgiving. A small mistake in key rollover can take down domains for hours or days. It's happened to some of the most sophisticated organizations in the world.
These case studies show what goes wrong and why. Learn from their expensive mistakes.
The NASA Outage (2012)
What Happened
NASA's websites became unreachable for users with DNSSEC-validating resolvers. The outage lasted several hours.
The Cause
During a routine DNSSEC key rollover:
- NASA published new DNSKEY
- Updated DS record at parent zone
- Removed old DNSKEY too quickly
- Cached DS records still pointed to old key
- Chain of trust broken
- SERVFAIL for validating resolvers
The Timeline
- New keys published
- DS updated (but not propagated everywhere)
- Old keys removed
- Users with cached old DS couldn't validate new keys
- Outage begins
- Emergency rollback initiated
- Old keys restored
- Service gradually restored as caches updated
Key Lesson
Timing is critical. You can't remove old keys until the new DS has propagated everywhere AND all cached copies of the old DS have expired.
Safe rollover requires:
- Publish new key
- Wait for TTL
- Update DS at parent
- Wait for DS propagation + TTL
- THEN remove old key
This takes weeks, not hours.
The Slack DNS Incident (2021)
What Happened
Slack experienced significant accessibility issues for users relying on DNSSEC validation.
The Cause
During DNS infrastructure changes:
- DNSSEC signatures became stale
- Automated re-signing process failed
- Signatures expired
- Validating resolvers returned SERVFAIL
- Mixed user experience (some could access, some couldn't)
The Complexity
Slack uses complex DNS infrastructure:
- Multiple DNS providers
- Traffic management
- Geographic distribution
- Failover systems
One component's signing process failed, but not all queries went through that path. Result: intermittent, hard-to-diagnose failures.
Key Lesson
Complex infrastructure needs comprehensive monitoring. When you have multiple DNS paths, you need to monitor DNSSEC status on all of them.
Also: Automated signing can fail silently. Just because you set up automation doesn't mean it's working.
The Comcast DNSSEC Validation Issue (2015)
What Happened
Comcast customers suddenly couldn't access many websites. The ISP's validating resolvers were returning SERVFAIL for a large number of domains.
The Cause
This wasn't a rollover failure - it was a validation implementation bug:
- Comcast enabled stricter DNSSEC validation
- Many domains had subtle DNSSEC issues (previously ignored)
- Stricter validation = sudden failures
- Millions of customers affected
The Domino Effect
Domains that "worked fine" suddenly broke because:
- Expired signatures (that non-validating resolvers ignored)
- Missing NSEC records
- Algorithm mismatches
- Clock skew issues
Comcast's resolvers started enforcing the spec strictly. Spec-violating domains stopped resolving.
Key Lesson
Your DNSSEC might be broken and you don't know it. Non-validating resolvers hide problems. When a major ISP enables validation, suddenly your problems are visible.
Test with validating resolvers, not just your ISP's default.
The .gov DNSSEC Incident (2014)
What Happened
Multiple .gov domains became unreachable for DNSSEC-validating users.
The Cause
The .gov registry had DNSSEC signing issues:
- Scheduled key rollover began
- Automation had a bug
- Some domains weren't properly signed
- Partial outage across .gov
When the registry itself has DNSSEC problems, all domains under it are affected.
The Scope
- Affected federal government websites
- Taxpayer services
- Citizen information portals
- Government communications
Key Lesson
TLD-level failures affect everyone. If you're depending on a TLD with DNSSEC enabled, you're depending on their operational competence.
Nothing you can do if the registry breaks. This is an accepted risk of DNSSEC.
The Swedish TLD (.se) Incident (2009)
What Happened
The entire .se TLD became unreachable from DNSSEC-validating resolvers. Every .se domain affected.
The Cause
During routine DNSSEC operations:
- New zone file was generated
- File contained corruption
- Zone was signed including corrupted data
- DNSSEC signatures were valid (signed corrupted data correctly)
- But NS records were wrong
- Entire TLD resolution failed
The Scope
- Every .se domain affected
- Millions of Swedish websites
- Swedish businesses, government, services
- Lasted several hours
Key Lesson
DNSSEC doesn't validate semantic correctness. It proves data wasn't tampered with. It doesn't prove the data is right.
Garbage in, cryptographically-signed garbage out.
Zone validation must happen before signing, not after.
The Root Key Rollover (2018)
What Happened
ICANN performed the first-ever root zone KSK rollover - changing the ultimate trust anchor for all of DNSSEC.
The Risk
If this went wrong:
- Every DNSSEC-enabled domain could become unreachable
- From validating resolvers worldwide
- Internet-wide impact
The Preparation
ICANN spent years preparing:
- Multiple delays to assess readiness
- Extensive communication with operators
- Fallback procedures developed
- Monitoring infrastructure deployed
The Result
The rollover succeeded with minimal issues:
- A few resolvers failed to update trust anchor
- Some minor outages reported
- Overall: success
Key Lesson
Operational excellence is possible for DNSSEC, but requires:
- Extensive planning
- Clear communication
- Fallback procedures
- Patience (ICANN delayed multiple times to get it right)
Common Patterns in DNSSEC Failures
Analyzing these incidents reveals patterns:
Pattern 1: Timing Violations
Most common failure mode:
- Removing keys too early
- Not waiting for propagation
- Ignoring TTLs and caching
Rule: When in doubt, wait longer.
Pattern 2: Automation Failures
Automated signing is great until:
- Cron job stops running
- Permissions change
- Disk fills up
- Process gets killed
Rule: Monitor your automation. Alert on signing failures.
Pattern 3: Incomplete Testing
Testing with non-validating resolvers misses:
- Expired signatures
- Chain of trust issues
- Algorithm problems
Rule: Test with validating resolvers. Use DNSViz.
Pattern 4: Coordination Gaps
Complex environments need coordination:
- Multiple DNS providers
- Registry and registrar
- Internal teams
Rule: Document dependencies. Communicate changes.
Pattern 5: Cascading Failures
One DNSSEC issue leads to:
- Some users affected
- Difficult diagnosis (works for some, not others)
- Slow recognition of problem
- Extended outage
Rule: Have runbooks for partial outages.
Preventing Rollover Disasters
Before Rollover
-
Verify current state is healthy
- All signatures valid
- Chain of trust intact
- No existing issues
-
Document the plan
- Specific steps and timing
- Who does what
- Success criteria
- Rollback procedure
-
Communicate
- Internal teams aware
- Monitoring enhanced
- On-call briefed
During Rollover
-
Follow the timeline exactly
- Don't rush
- Don't skip steps
- Wait for propagation
-
Monitor continuously
- Validation status
- Error rates
- User reports
-
Have rollback ready
- Know exactly how to undo
- Keep old keys available
- Test rollback in staging if possible
After Rollover
-
Verify from multiple locations
- Different resolvers
- Different ISPs
- Different countries
-
Monitor for delayed issues
- Cache expirations
- Edge cases
- Late reporters
-
Document what happened
- Update runbooks
- Share learnings
- Improve for next time
When to Abort
Signs you should stop and rollback:
- SERVFAIL from validating resolvers
- DNSViz showing chain of trust errors
- User reports of unreachable site
- Monitoring alerts for DNS failures
- Any unexpected behavior
Better to rollback and try again than push through a broken rollover.
The Bottom Line
DNSSEC failures at major organizations prove that:
- DNSSEC is operationally complex
- Even experts make mistakes
- Timing and coordination are critical
- Monitoring is essential
- Rollback capability is mandatory
The organizations that recovered fastest had:
- Clear runbooks
- Good monitoring
- Quick escalation paths
- Pre-planned rollback procedures
Learn from their failures. Your rollover doesn't have to be a disaster.
Recommended Reading
- DNSSEC Explained - Basics
- Common DNSSEC Misconfigurations - Detection
- Is DNSSEC Worth It? - Risk/benefit analysis
Morten Pradsgaard is the founder of exit1.dev — the free uptime monitor for people who actually ship. He writes no-bullshit guides on monitoring, reliability, and building software that doesn't crumble under pressure.