Distributing Databases and Data Globally
Beyond cloud zones and regions, a vision for 'regionless' data.
Welcome to the Cloud Database Report. I’m John Foley, a long-time tech journalist who worked in strategic comms at Oracle, IBM, and MongoDB. This blog and free newsletter are independent, unsponsored, and not affiliated with my current role as VP with Method Communications.
A couple years ago, I predicted that distributed data would become a business-level issue and strategy in more organizations. I believe we’ve reached that point.
Spreading data across cloud zones, regions, and data centers matters for lots of reasons: regulatory requirements around data residency and localization, disaster recovery/resilience, well-conceived data architecture (including multi-cloud), and low-latency app performance, to name a few.
All of this has been coming up a lot lately. Oracle is building 100 new cloud data centers, according to Data Center Dynamics. Why? Its customers need to process, manage, and store data many places around the world. “Demand is enormous,” said Larry Ellison on Oracle’s Q2 2024 earning call.
‘Regionless’ data architecture
And then there’s Cockroach Labs with its distributed SQL database. Distributing databases and distributing data are not one in the same thing, but they’re related. Database design, cloud architecture (hybrid, multi-cloud), application workloads, speed/latency, replication all come into play in both cases.
Cockroach Lab’s target customers are chief architects who design and build distributed data infrastructure. When I met with Cockroach Labs CEO Spencer Kimball last October, we talked about multi-cloud and multi-region deployments, the challenges and opportunities of working with global customers, and data localization.
“It’s not easy for customers,” Kimball said. “You’re talking about a much larger area of concern—data sovereignty, domiciling, network costs, latencies.”
Kimball recounted being in a meeting where Microsoft CEO Satya Nadella described CockroachDB as not just multi-region, but “regionless”—a powerful idea. “So a customer doesn’t think of the world as regions anymore,” Kimball said. “It is an experience where those kinds of boundaries and the difficulties associated with them have been transcended.”
Few if any companies have reached this kind of distributed data Shangri-La, but it would be a strategic advantage for those that do. For global businesses, Kimball says, seamless data distribution across regions is “important to their businesses and their strategies.”
For more, see my interview with Spencer Kimball below.
Dynamic data movement
Informatica was a startup when I covered databases for InformationWeek magazine, beginning in the 1990s. Back then, Informatica was focused on data warehouses, ETL (extract, transform, load), and master data management.
It still does all that, but Informatica has transformed itself into a full-fledged cloud platform company. Its AI-powered Intelligent Data Management Cloud platform offers capabilities such as data integration, cloud connectivity, data governance, and dynamic data movement, all of which are essential for managing distributed data.
Check out Informatica’s Modern Data Architecture Center, with resources for anyone trying to get their arms around data management and distribution. It includes reference architectures for data mesh, data fabric, cloud data warehouses, data lakes, etc.
As all of this illustrates, data architecture can be pretty complex. Which is why automation, AI/ML, intelligent cloud services, and managed cloud services have become so essential—to simplify the underlying complexity. Or at least some of it.
How it’s done
Here’s a real-world example from Figma, a popular platform for collaborative design with a demanding data environment.
The databases team at Figma recently published an excellent blog post detailing a 9-month project to horizontally shard its Postgres database. (Sharding is the process of breaking a database into pieces to run across multiple servers. For more, see this FAQ on sharding from AWS.)
Figma tackled this project because its database stack had grown 100X. Software engineer Sammy Steele, author of the post, explains how the team arrived at its go-ahead plan. They considered CockroachDB, TiDB, Google Cloud Spanner, and Vitess—all distributed database platforms—but ultimately decided to stick with their existing database, Amazon RDS for Postgres. They use AWS’s multi-zone capability, RDS Multi-AZ, for geo distribution and durability across availability zones.
I asked Steele why Figma decided to stick with RDS after evaluating some of the other options out there.
“Even with all these ‘managed’ distributed database solutions out there, when you get to a reasonable scale, engineers still end up having to develop a lot of expertise in their specific managed database of choice,” Steele said via email. “Things like latency and reliability profiles also differ drastically across different solutions. This makes migration overhead a lot higher between systems than people would expect.”
For example, Figma engineers had to dig into the source code while using RDS to understand and debug things like CPU spikes and connection pooling. “You’d think a managed solution ‘just works’ but at scale that is almost never true.”
Yep. As I said above, automation aims to simplify underlying complexity—but your IT team may still need to bring its own expertise to the party.
You can read Figma’s full blog post here: “How Figma’s databases team lived to tell the scale.”
Shout out to the Figma databases team for sharing their experience and learnings!
Further reading
What I’ve tried to illustrate here is that there are can be many moving pieces when it comes to distributing databases and data. Global data centers, DBMS, data integration, sharding, and replication are all be aspects of a distributed strategy and architecture. This post really just scratches the surface; there are many ways to do it.
For more, some related resources:
“What is a distributed database and how do they work?” - Cockroach Labs
“Globally distributed database” - Oracle
“What is data replication?” - IBM
Watch for my upcoming blog post on a few new database startups.
And connect with me on LinkedIn.