How to Handle a Website Outage: Troubleshooting and Restoring Availability
John Locke outlines a methodical, layered approach to diagnosing website outages. He begins at the DNS level—checking domain registration via whois
and name server responses using dig
—then moves into network routing, hardware, and database infrastructure. Each stage includes practical tools like ping
, traceroute
, and recovery consoles to isolate faults.
The article highlights common causes like expired domains, misconfigured name servers, routing failures, server hardware faults, and resource exhaustion from crawlers or database deadlocks. One unexpected but insightful callout: “unruly crawlers” (e.g. from AI services) have recently caused more downtime than hardware issues.
Locke’s advice is concrete and actionable: monitor disk utilisation, test layers sequentially, tune databases, and, if necessary, migrate VPS instances. The post excels as a clear, developer-focused troubleshooting guide. However, it lacks deeper context on preventative design like load‑balancing, automated failover, or redundant infrastructure.