A Million Times More Data. Yes, It's Coming.
Data architects must plan for exponential growth as data stores swell from petabytes to exabytes and beyond.
There was a time when a terabyte was considered “big data.” I once wrote a cover story for InformationWeek magazine titled “Towering Terabytes” that explored the emergence of terabyte-size data warehouses. That was 26 years ago.
These days, a terabyte (TB) isn’t a big deal. You can buy a 1TB storage drive for $20. Many organizations now manage petabytes (PB) of data, and some have exabytes (EB). An exabyte is a million times more than a terabyte.
AWS, Google Cloud, Microsoft, and Oracle all talk about exabyte-size data stores, and they’re showing up in more places. For example, Apple reportedly had 8EB of data in Google Cloud in 2021. And AWS has a data migration service, called Snowmobile, for moving exabytes—a 45-foot shipping container on wheels to physically transport the data from a customer’s data center to the AWS cloud.
That brings me to my golden rule of database scalability: Expect—and prepare for—a million times more data over the course of your career.
I’ve written about this trend before. I’ve actually witnessed 1,000,000x data growth—from terabytes to petabytes to exabytes—over the past 25 years or so. See the article below on Microsoft’s Scalability Day in 1997 for the full story.
I have no doubt that large enterprises will continue to experience comparable data growth, if not more. In fact, “we’re in the midst of it,” Chris Gladwin told me recently. Gladwin is Co-founder and CEO of Ocient, a startup that specializes in hyperscale data.
Petabytes and exabytes are being generated today by a new generation of devices, apps, and workloads—steaming data, unstructured data, geospatial, ML, IoT, stock exchanges, telemetrics, and much more.
Gladwin points to 5G cellular as the motherlode of big and bigger data. The total number of 5G subscriptions is expected to grow from 1 billion today to 5 billion by the end of 2028. “It causes the scale of everything in networking to increase 20x to 50x in one big jump,” he says.
Data is measured in increments of 1,000 bytes: kilobytes, megabytes, gigabytes, terabytes, petabytes, exabytes, zettabytes, and yottabytes. According to IDC, worldwide data is on track to grow to 175 zettabytes by 2025.
3 flavors of hyperscale
I caught up with Gladwin recently to get an update on Ocient, which is carving a niche at the high end of the data warehousing market. We last spoke 15 months ago when the Chicago-based company was still an early-stage startup working with a few early adopters.
Ocient has now advanced to growth stage, having just announced 171% year-over-year growth (though actual revenue was not disclosed). A few updates:
Ocient’s Hyperscale Data Warehouse is now available in the AWS Marketplace and the Google Cloud Marketplace.
Customers include Basis Technologies, a provider of workflow automation and business intelligence software for marketing and advertising; and MediaMath, a leader in ad-tech solutions.
Ocient released Version 21 of its platform in January. New capabilities include geospatial data analytics, secondary indexing, and machine learning data models.
The Ocient data warehouse is now available three ways: as a fully managed service on AWS or Google Cloud; hosted by Ocient in the OcientCloud; and on premises. Here’s an earlier post I wrote with some additional context.
Trillions of records
What’s most interesting about Ocient is its singular focus on extreme-size data warehouses—analytical workloads with trillions of records.
The starting point for an Ocient data warehouse is half a petabyte. On the AWS Marketplace, Ocient’s standard setup is 900TB of data storage and 240 CPU cores, at a list price of $625,000 for 12 months.
Ocient is architected for performance at scale. The company’s Compute Adjacent Storage Architecture connects fast NVMe disk drives to CPUs via a PCI bus. The original design point was a system that could perform 10 billion inserts a second, support a quadrillion rows in a table, and return query results in a few seconds.
Gladwin estimates there are only 500 to 1,000 enterprises that currently need this kind of hyperscale data warehouse, though I would expect that number to grow as more businesses look to analyze their ever growing data estates. Typical use cases for such heavy-duty analysis are digital ad auctions, stock-ticker trending, and telecom network traffic. It’s also telling that In-Q-Tel, a VC firm with ties to U.S. intelligence agencies, is among Ocient’s investors.
Booming bits
Moore’s Law is another way to think about what I sometimes call “booming data.”
We’re all familiar with Moore’s Law, the idea that the number of transistors on a microprocessor doubles every 18 to 24 months, with a correlated increase in compute performance. Guess what: Gladwin says data must be able to scale in pretty much the same linear fashion.
And that seems to be what’s happening. By my rough calculation, data growth has been tracking loosely with Moore’s Law since 1997, the year Microsoft demoed 1TB on SQL Server. Here’s the math: If a terabyte of data were to double every 18 months since 1997, that would amount to an exabyte in 2027.
In other words, we’ve gone from 1TB being enterprise scale in 1997, to 1EB being enterprise scale over the next few years. That sounds about right to me.
It’s an inexact science, no doubt. The point I really want to make is that, while it’s common to talk about data volume growing 2x or 10x, data strategists with a long-term view should be thinking about 1,000x or 1,000,000x.
Because database history shows us that’s what happens. Already, storage vendors are thinking beyond exabytes to zettabytes.
Bottom line: Envision architectures and databases that can scale years into the future.