AI to Drive 'Significant Increase' in Data Storage Requirements
The need for storage and memory is expected to grow even as the CapEx ratio declines, according to this researcher
Hello and welcome to the Cloud Database Report. I’m John Foley, a long-time tech journalist, including 18 years at InformationWeek, who then worked in strategic comms at Oracle, IBM, and MongoDB. I invite you to subscribe, share, comment, and connect with me on LinkedIn.
The gist:
AI will drive a significant increase in data center storage capacity, according to ASU researcher Zhichao Cao
High-performance storage in AI data centers is primarily for unstructured data
Compared to GPUs, the cost of storage is negligible
Model training requires ultra-fast storage
Inferencing demands low-latency access to precomputed models and reference data
Disaggregation: what’s old is new again
We hear a lot about data processing in newly designed AI data centers with 100,000-plus advanced GPUs. But where and how will the many petabytes of AI-generated data be stored?
The separation of data processing and data storage goes back 50 years — to client/server, Amazon S3, data warehouses, etc. Will the next-gen AI data centers under construction today reflect that same architectural philosophy?
To get an answer to that question, I reached out to Zhichao Cao, an assistant professor at Arizona State University, who was recently awarded a five-year stipend by the National Science Foundation to research the design of key-value stores (a widely used NoSQL database type) in disaggregated data centers. In other words, Professor Cao is an expert on the separation of processing and storage in a modern context.
Prior to his current work, Cao was a research scientist at Facebook, where he worked on RocksDB, an open source key-value database that is optimized for fast storage.
His work caught my attention in an article titled, “AI Drives Need for Real Data Storage Innovations.” Following is our interview, conducted by email.
Q&A with Professor Cao
Q: We see a lot of CapEx investment in new data centers with GPUs for AI training, inference, model development. But what about data storage? Will the storage ‘footprint’ in these AI data centers be more, or possibly less, than in traditional data centers?
Cao: If we compare the total capacity of storage devices (especially high-performance storage devices like NVMe SSDs) and memory deployed in AI data centers, we observe a significant increase. Unlike traditional data centers, high-performance storage in AI data centers is primarily used for unstructured data, such as images, videos, training data, and logs, and supporting AI training and inferencing. These are managed through distributed key-value stores, disaggregated storage systems, or distributed file systems to support training and inference workloads.
However, the CapEx ratio for storage and memory has significantly decreased in AI data centers, mainly due to the extremely high cost of GPUs. Considering the price difference — $20K to $30K for a high-end GPU compared to approximately $500 for a 4–8 TB NVMe SSD or $200 to $300 for 64GB of DDR5 memory — the expenditure on storage and memory becomes almost negligible.
Q: Do you expect AI to drive growth/demand for data storage? I’m hypothesizing that AI development and apps will generate more data, but I have not found industry experts who can confirm that. I've asked Salesforce, Oracle, and Google Cloud, but they did not have insights on this question.
Cao: I expect AI to drive significant growth in data storage demand. At the current stage, generative AI models like ChatGPT primarily focus on text generation, with most training data also being text-based. There is no explicit demand for more storage support.
However, as AI continues to evolve, the generation of images and even videos—such as OpenAI’s Sora — will become more prevalent. This shift will significantly increase the need for high-performance storage solutions capable of handling large-scale photo and video datasets. Fast and efficient storage will be essential not only for collecting and managing training data but also for storing and retrieving the vast amounts of generated multimedia content.
Q: In the work that you do, how might AI change how cloud hyperscalers (AWS, Google Cloud, Microsoft Azure, Oracle) plan and architect for data storage?
Cao: My research focuses on redesigning and optimizing data systems, including key-value stores and database storage engines, for disaggregated data centers to support a wide range of applications, including big data and AI. Traditionally, data centers relied on monolithic servers to handle various workloads. However, at different stages of AI applications — such as data collection, preprocessing, training, inferencing, and management — the demands for compute, memory, and storage resources vary significantly.
With the rapid advancement of AI, cloud hyperscalers are increasingly adopting disaggregated architectures to improve resource utilization and efficiency. Storage disaggregation, in particular, has become a standard practice among cloud providers, enabling fine-grained heterogeneous storage solutions — ranging from HDD-based archival storage to high-performance NVMe SSDs for AI training workloads. For example, AI model training requires ultra-fast storage to keep GPUs fully utilized, while inferencing workloads demand low-latency access to precomputed models and reference data. Moreover, as resource disaggregation evolves, managing and deploying applications in such environments becomes more complex.
My research also explores how AI can enhance data storage management efficiency. For instance, we investigated the use of large language models (LLMs) like ChatGPT for automatic key-value store configuration and tuning (Best Paper Award at ACM HotStorage 2024). Cloud hyperscalers might also invest and explore AI-driven optimizations and management to make modern data storage more adaptive, intelligent, and efficient.
Market validation
My thanks to Zhichao Cao for sharing his insights. They confirm my line of thought that AI will be a catalyst for new data types, including multimodal, and much more data.
As I have noted in previous posts, enterprise data stores are forecast to more than double over the next two years. So there’s a growing consensus that more AI will lead to more data.
Note: I will have a follow-on post in the days ahead on Prof. Cao’s research into LLM inferencing and memory/storage, a topic that has come to the forefront in the wake of DeepSeek’s performance. So, more to come.
Further reading
“Can Modern LLMs Tune and Configure LSM-based Key-Value Stores,” ASU
“DeepSeek AI Will Increase Data Storage and Make It More Accessible,” by Thomas Coughlin, Forbes
“AI Data Explained: Vectors, Multimodal, Agentic, Synthetic,” Cloud Database Report