The Aftermath of AWS's Crippling Database Glitch
Increased error rates, health check failures, connectivity issues, and DNS resolution problems wreaked havoc on sites, services, and apps.
Welcome to the Cloud Database Report. I’m John Foley, a long-time tech journalist, who also worked in strategic comms at Oracle, IBM, and MongoDB. Connect with me on LinkedIn.
Another day, another global computer system failure. This one started with Amazon’s DynamoDB cloud database.
Late in the day on Sunday, Oct. 19, AWS begin experiencing operational issues in its US-East-1 region, the epicenter of which is in northern Virginia. The glitch spread like a bad cold on an airplane. Websites, online services, and apps were severely hampered throughout Monday, the start of the business week.
Early estimates put the potential cost of the snafu into the hundreds of billions of dollars. Why so much? According to the New York Times, the global brands impacted included Canva, Coinbase, Hulu, Lloyds Bank, McDonald’s, Microsoft 365, Slack, Venmo, Whatsapp, and Zoom. Also affected were airlines, restaurants, financial services, and e-payment/point-of-sale systems. Even Internet-connected smart home devices were on the blink.
The last time the world experienced such a far-reaching service outage related to a software glitch was CrowdStrike’s infamous Content Validator bug in July 2024. At the time, it was described as the largest IT failure ever. This one may be worse.
AWS provided a detailed accounting of the cause and impact, which it described as happening in three phases over 14 hours. In AWS’s own words:
First, between 11:48 PM PT on October 19 and 2:40 AM PT on October 20, Amazon DynamoDB experienced increased API error rates in the N. Virginia (us-east-1) Region.
Second, between 5:30 AM and 2:09 PM on October 20, Network Load Balancer (NLB) experienced increased connection errors for some load balancers in the N. Virginia (us-east-1) Region. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs.
Third, between 2:25 AM and 10:36 AM on October 20, new EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM, some newly launched instances experienced connectivity issues which were resolved by 1:50 PM.
The sequence of events was caused initially by a Domain Name System (DNS) resolution error, according to AWS. DNS misconfigurations have long been associated with cloud service failures.
The little bit of good news: By day’s end on Monday, AWS said it had fully recovered.
The reaction
Industry experts responded as they always do when these things happen—with explanations of the root cause issues, warnings about over-reliance on too few service providers and/or single points of failure, and reminders about the need for resiliency in data and network architectures.
Here’s a roundup of some of the after-the-fact commentary:
“Today’s AWS outage is a wake-up call for every business: If your stack depends on a single region or has hard dependencies on services that do, you’re prone to losing business continuity in the event of this class of cloud provider outages.” - Spencer Kimball, CEO Cockroach Labs, on LinkedIn.
“This is about ensuring the uptime of a critical service, no matter what happens, not “if it happens.” We call this approach ultra-resilience. Be aware of the classes of failures that can hit you, and be deliberate about building resilience into your architecture.” - Karthik Ranganathan, co-CEO Yugabyte, on LinkedIn.
“There’s been a massive amount of centralization in terms of our dependence on a small number of cloud providers. When they go down, so much of what we depend on goes down.” - Prof. David Choffnes, director of Northeastern University’s Cybersecurity and Privacy Institute, in Northeastern Global News.
“The bottom line is clear: Public cloud remains the best option for scalable infrastructure if you invest in resilience upfront—or correct existing deployments if necessary. Don’t let fear steer you toward costly or ineffective alternatives; instead, double down on architecture, process discipline and transparent partnerships with your providers.” - Lydia Leong, Distinguished VP Analyst, Gartner.
“Failures increasingly trace to [data] integrity. Corrupted data, failed validation or, in this case, broken name resolution that poisoned every downstream dependency. Until we better understand and protect integrity, our total focus on uptime is an illusion.” - Davi Ottenheimer, VP with Inrupt, a provider of enterprise wallet infrastructure, in Wired.
“The only solution is to start thinking about resiliency at the Board level.” - Mehdi Daoudi, CEO of Catchpoint Systems, in IT Brew.
The need for 9’s
How does this sh*t continue to happen? Even the most steadfast experts on system and network reliability admit the occasional blip is nearly unavoidable. Blame it on the complexity of large scale, global systems that support millions of people and devices, billions of transactions, and exabytes of data each minute of the day.
Yugabyte’s Ranganathan puts it this way: “We can only aspire to reduce the probability of failure, not eliminate it. However small the chance of failure, given enough scale and exposure, there will inevitably be one.”
True enough, but there are steps technologists can take to minimize system failures and outages if not completely eliminate them. A common measure of system availability are the so-called “nines.” Four nines (99.99%) of availability translates into about 52 minutes of downtime per yer. Five nines (99.999%) equals 5 minutes of downtime per year. Six nines (99.9999%) brings it down to 31.5 seconds of downtime per year.
Amazon DynamoDB is available to customers with a 99.99% SLA and, when Global Tables are implemented (i.e. where database tables are replicated across multiple regions), there’s a 99.999% SLA. But those SLAs are for the managed DynamoDB services that AWS provides to its customers, and presumably don’t apply to AWS’s own mega outage.
You do pay a premium for the higher availability levels, of course. According to Perplexity, five 9s can cost ten times as much as four 9s. But then, if the cost of downtime is millions of dollars per hour, the extra cost may be worth it.
No doubt, CIOs, CTOs, and business leaders will use this latest cloud-service mishap to reassess their own level of readiness. They will go back to the drawing board on resiliency, backup, data distribution & replication, availability zones, contingency planning, and disaster recovery. And rightly so, because this may have been AWS’s latest and biggest service outage, but it won’t be the last.



