Report: OpenAI Is Shopping for 5 Exabytes of Data Storage
How much data does it use for R&D? ChatGPT explains.
Hello and welcome to the Cloud Database Report. I’m John Foley, a long-time tech journalist, including 18 years at InformationWeek, who then worked in strategic comms at Oracle, IBM, and MongoDB. I invite you to subscribe, share, comment, and connect with me on LinkedIn.
OpenAI has been in discussions to purchase billions of dollars worth of data storage hardware and software and build its first data center to house the gear. It’s in the market for up to 5 exabytes of data storage capacity — that’s a thousand times more data than you might find in a large enterprise.
This is according to reporting by Anissa Gardizy with The Information. You can see the article here, but it requires a subscription: “OpenAI Discusses Building Its First Data Center for Storage”
OpenAI is looking to deploy as much as 5 exabytes of storage in the data center, according to The Information. An exabyte is a thousand times more than a petabyte, which is a common standard of measure in large enterprises.
“Such a deal would make OpenAI one of the world’s biggest storage customers overnight and would reflect its desire to have more control over data that are critical to developing artificial intelligence,” Gardizy writes citing three unnamed sources.
Big picture: OpenAI’s storage data center would be part of the $500 billion Stargate AI infrastructure project announced in January with Oracle and SoftBank.
A deal has not been finalized and OpenAI may just be looking for negotiating leverage, notes The Information. Even so, the mere fact that it’s doing so underscores just how data-hungry AI has become.
“Data storage is critical to developing complex AI models,” sums up The Information as a takeaway.
The AI data continuum
Regular readers of the Cloud Database Report shouldn’t be surprised that OpenAI is shopping for storage. We’ve seen this coming.
“I’m curious about how much new data will be generated by AI systems and applications, and I believe it’s going to be a big multiple,” I wrote last August. “Maybe 10, 100, or even 1,000 times more data within the next few years. Today’s petabytes, could be tomorrow’s AI-fueled exabytes.”
And so it is.
Until now, much of the industry discussion has been about the fact that AI training consumes vast amounts of data from across the Web. But it’s becoming increasingly clear that AI systems and apps are generating data at unprecedented scale.
We don’t know what kinds of data OpenAI needs to store and manage or how it’s being used. So, I asked ChatGPT the following question: “In addition to using data for training and inference, how much data does OpenAI create in its research and development?”
Following is ChatGPT’s full and complete response (in quotation marks):
“OpenAI generates a significant amount of data during its research and development process, beyond just the data used for training and inference. This data is crucial for refining and improving AI models. Here’s a breakdown of the types of data OpenAI generates as part of its R&D:
1. Experimental Data
Model performance data: During model development, OpenAI conducts various experiments to test how well the models perform on different tasks, benchmarks, and datasets. The data generated includes the outputs from the models, performance metrics (accuracy, precision, recall), and error analysis.
Model evaluations: OpenAI tests models on specific tasks (e.g., translation, summarization, question answering) and records detailed data on how the models perform, identifying weaknesses and areas for improvement.
2. Internal Testing Data
Quality Assurance (QA) and Stress Testing: OpenAI runs many internal tests to ensure that models function as expected in real-world scenarios. These tests might involve generating and analyzing model outputs for various input prompts. The data from these tests includes logs of model outputs, system performance, and feedback from testers.
3. Feedback Data
User feedback: OpenAI collects feedback from users who interact with its products, such as ChatGPT. This data helps identify problems or limitations in the models and provides insights into real-world use cases.
Human feedback for fine-tuning: OpenAI employs a technique called Reinforcement Learning from Human Feedback (RLHF), where human evaluators rate the quality of model outputs. This data is used to fine-tune models and improve performance on more subjective tasks like generating creative content or answering ethically sensitive questions.
4. Data from Collaborative Research
Partnerships and collaborations: OpenAI collaborates with academic institutions, research organizations, and companies to develop new techniques in AI. Data generated from these collaborations might include shared datasets, experimental results, and findings from joint research papers.
Public datasets: When OpenAI conducts research, it often releases datasets to the public or collaborates with the open-source community to make data available. For example, datasets related to NLP (Natural Language Processing) tasks or image recognition might be shared to foster wider advancements in the field.
5. Model Training Data and Logs
Training runs: OpenAI runs many experiments with different training configurations (hyperparameters, architectures, etc.). The logs and data from these runs are often stored and analyzed to determine the optimal configurations for specific tasks or models.
Dataset creation and curation: OpenAI also creates or curates specialized datasets to train models for particular tasks. For example, they may curate datasets that focus on language understanding, coding, or image generation.
6. Monitoring and Deployment Data
Deployment and operational data: Once models are deployed, OpenAI monitors their performance in real-time. This includes collecting data on how the models are being used, how quickly they respond, and whether they make errors or encounter biases. Such operational data is used to improve the overall performance of the models.
Data from system logs: Logs from running models in production (such as on the OpenAI API) are generated during normal operations. These include metadata about the inputs and outputs as well as performance statistics for system health and debugging.
7. Research Publications and Analysis
Scientific papers and models: Data and findings from OpenAI's research are often published in papers. These include experimental data, model architectures, algorithms, and case studies that others in the AI community can use to advance their own work.
8. Synthetic Data Generation
Generating synthetic data: OpenAI sometimes creates synthetic data to augment existing datasets or to simulate situations that would be difficult to test with real-world data. This synthetic data is particularly useful in situations where real data is scarce or sensitive (such as medical data).
9. Data for Ethics and Safety Research
Bias detection and mitigation: OpenAI also generates data to study the ethical implications of AI, such as datasets designed to test models for biases, fairness, and alignment with human values. This data helps OpenAI understand how AI models can be made safer and more ethically responsible.
Conclusion
In total, OpenAI generates a large volume of data throughout its research and development process. This data serves to enhance the performance, robustness, and safety of AI models, while also ensuring that they align with ethical standards and provide real-world value. The scale of this data collection and generation depends on the scope of the research, the models being trained, and the feedback and real-world usage that OpenAI receives.
OpenAI carefully manages and utilizes this data to continually improve its models and to innovate in the AI space.”
End quote.
Zettabytes are next
So the picture is becoming clearer that OpenAI and no doubt many other “frontier” AI companies need storage on both ends of the continuum: input and output. I’m sure the rise of agentic AI to add fuel to the fire.
Two big unanswered questions: How much new data will AI generate? And how fast? Beyond exabytes, there are zettabytes and yottabytes.
For more on AI data, see my previous posts below.
From August 2024:
From February 2025: