A Year of Database Glitches, Failures & Outages
AWS, Google Cloud, Microsoft, Oracle, Slack, and Snowflake didn't escape unscathed.
Welcome to the Cloud Database Report! I’m John Foley, a long-time tech journalist, who also worked in strategic comms at Oracle, IBM, and MongoDB. Connect with me on LinkedIn. Please subscribe (free or paid) to support my work.
Where are all the self-healing databases?
We’ve been hearing about databases that autonomously repair themselves for more than 20 years. It was back in 2004 that the Journal of Intelligent Information Systems published research on the design of a self-healing database system. That same year, Oracle published a seminal white paper, “The Self-Healing Database.”
Goes to show that the tech industry doesn’t always move at lightning speed. Here it is 2025 and we’re still waiting for the full promise of self-detecting, self-diagnosing, self-fixing databases. And without them, database gotchas are costing businesses billions of dollars in downtime.
Q4 provided two high-profile examples of what can go wrong. On Oct. 19, a glitch with AWS’s DynamoDB service disrupted digital business for global brands including Canva, Coinbase, Hulu, Lloyds Bank, and McDonald’s. See my full analysis here.
Then, on Nov. 18, Cloudflare experienced a disruption that froze websites and services, including X and ChatGPT. Cloudflare blamed the failure on a change in database configuration that caused its services — core CDN and security services, Turnstile, Workers KV, Dashboard, and more — to topple like dominoes.
Cloudflare’s postmortem was painful to read. Cloudflare initially thought it was under cyberattack, then figured out it was an internal issue. Intra-system dependencies and unrelated problems (e.g. its Status Page happened to go on the blink) further confused the situation. It took eight hours to repair and recover.
There were other miscues throughout the year.
In January, Microsoft Azure networking issues affected Azure Databricks, Azure Database for PostgreSQL, Azure SQL Database, and other services.
In February, Slack suffered a 9-hour service outage related to database sharding.
In April, a hospital group was forced to switch to paper-based patient records following a maintenance mishap involving Oracle, according to CNBC.
In June, it was Google Cloud’s turn with a 7-hour outage that affected various products, including some of its data and database services.
In July, a coding agent wiped out a live database.
In November, border crossings in war-torn Ukraine were temporarily halted due to a technical issue with a new database.
In December, Snowflake responded to two separate database-related outages.
It’s not always clear which database platforms were involved in these misadventures, but a little digging sometimes gets at it. The New Stack article on the Slack outage (link above) surmises that it might trace back to MySQL/Vitess. Forensics are often required.
The existing HA playbook isn’t enough
IT leaders recognize that database-related glitches pose a risk to their business operations. In a recent survey by Yugabyte, high availability (HA) and resilience were the #1 priority — cited by 71% of respondents — when technologists were asked what database attributes were most important to their modernization plans.
The existing resiliency playbook includes a handful of familiar strategies: HA architecture (clustering, replication, failover); distribution among zones and regions; backup and recovery; monitoring and observability; and change and release management.
But those things obviously aren’t enough given the outages we continue to witness and experience. The incidents with AWS and Cloudflare are a head-scratcher because one would think they should know how to avoid such gaffes.
Highest possible availability?
All of the leading database vendors, and built-for-resiliency specialists like Cockroach Labs and Yugabyte, offer solutions meant to tamp down system outages: distributed databases, managed services, different types of replication, partitioning/sharding, clustering.
In one recent example, Oracle announced GA of Oracle Globally Distributed Autonomous Database, which it describes as follows:
“Oracle’s Globally Distributed Autonomous Database splits a single logical database into multiple physical databases (called shards) that are distributed across multiple data centers, availability domains, or regions. Faults in one shard do not affect others, enhancing overall availability. Automatic replication of shards across domains or regions provides protection from outages. The Oracle Globally Distributed Autonomous Database runs on fault-tolerant Exadata infrastructure for the highest possible availability.”
Highest possible availability sounds good until the .001% chance of failure happens, which is the slim-but-real possibility in a “five 9s” (99.999%) service-level agreement.
We can’t expect perfection, but we must do better. AI-enabled self-healing databases are the next great hope.
AI to the rescue
Vendors are introducing AI capabilities to increase the resiliency and availability of their DBMSs. There are various ways that’s happening, as you can see from Perplexity.ai’s response to my query below.
Query: What are some examples of how AI capabilities are being used within database management systems to make them more resilient and highly available?
Response: AI is being embedded into modern databases to predict failures, automate recovery, and continuously tune systems so they stay online and performant with minimal human intervention.
Predictive failure detection
ML models analyze metrics such as latency, error rates, log patterns, and resource usage to forecast node or cluster problems before they cause outages.
Autonomous self-healing and repair
Autonomous databases combine health monitoring with machine learning to trigger automatic repair actions such as instance restarts, resource rebalancing, or configuration changes when risk thresholds are crossed.
Intelligent failover, backup, and recovery
AI systems can choose optimal backup windows and retention policies based on observed workloads and risk patterns, reducing the chance that restores impact peak usage. In distributed cloud databases, AI helps orchestrate replication and automated failover across regions or availability zones to maintain service during node, zone, or even regional failures.
AI-driven anomaly detection for resilience
Generative and discriminative models are used to detect anomalous workload behavior or access patterns that might signal misconfigurations, partial failures, or security incidents before they propagate. These models help maintain both availability and data integrity by flagging and sometimes isolating suspicious nodes or traffic without taking the entire cluster offline.
Adaptive performance and workload management
ML-based query optimizers learn from past executions to choose better plans, reduce resource contention, and avoid pathological plans that can stall or crash nodes under load. AI agents continuously tune indexes, caching, and placement of data/partitions, which can mitigate hot spots and improve resilience under skewed or bursty traffic.
> End of Perplexity response <
‘Added complexity’
Why does resilience continue to be such a pain in the DBMS derriere? In a word, complexity.
“The use of software-based resiliency strategies alongside physical failover/redundancy is undoubtedly contributing to overall improvements in availability,” according to a report by the Uptime Institute. “However, the added complexity brings its own challenges and can blur lines of responsibility for failures, complicating root cause analysis and outage classification.”
There’s more data, new data types, and more devices and AI agents generating data, so databases are growing in size. By my estimate, there may be 10x, 100x, or even 1,000x more data to manage.
Also, there are infrastructure dependencies — hardware, network, power supply. Other risks include software bugs, data or file corruption, and cyberattacks. And human error is a factor all too often. “The failure of staff to follow procedures has become an even greater cause of outages,” finds the Uptime Institute.
The conundrum is that even as complexity rises so does the need for greater resiliency. We can only hope that Self-Healing Databases 2.0 do not take another 20 years to get here.





