AI Data Explained: Vectors, Multimodal, Agentic, Synthetic
New data types, complex interdependencies, and specialized models are emerging
Hello and welcome to our many new subscribers! I’m John Foley, a long-time tech journalist, including 18 years at InformationWeek, who then worked in strategic comms at Oracle, IBM, and MongoDB. I invite you to subscribe, share, comment, and connect with me on LinkedIn.
The gist:
This post is about the composition of data going into AI systems and generated by them
LLMs are increasingly becoming multimodal
AI agents and reasoning are data intensive
Some AI uses fake data but it’s not as good as the real thing
AI is driving even greater levels of data volume
I’ve used GenAI to assist in compiling this post, where noted
The word “data” traces its modern meaning — “storable and transmittable information in computing” — to 1946, according to Perplexity. So we’ve been dealing with computer data for ~79 years.
How is data changing in the fast-emerging world of artificial intelligence?
I’m talking about AI data here. The more common phrase is “data and AI” but I’m marrying them together in reference to interdependencies that I will explain below.
Let’s start by putting some chalk lines on the playing field. Here’s what Anthropic’s Claude offers as some of the most common AI data types:
Structured - databases, spreadsheets, time series, numerical arrays
Unstructured - text, images, audio, video, sensor
Semi-structured - JSON, XML, email, log files
Multimodal - combined text/image, audio/visual, mixed sensor data
Synthetic - AI-created content, simulated environments, augmented datasets
Metadata - tags, timestamps, categories, relations
That’s a start, but incomplete. Claude seems to have overlooked vectors, which are one of the hottest new AI data types. Vector embeddings — which are images and other kinds of data represented as long strings of numbers — are being implemented in more orgs for use cases such as similarity search, recommendations, and image matching.
From traditional to advanced multimodal
Much of what’s happening involves processing and combining data types in new ways, which brings us to multimodal data (see list above).
Multimodal data and databases have been around for many years. I’ve been writing about them since at least the mid-1990s when Informix introduced its Universal Database, an object-relational database management system designed to support many different data types. (Informix was acquired by IBM in 2001.)
Today, large language models (LLMs) are both consuming and generating multimodal data. The term being used for these next-gen applications and use cases is “advanced multimodal.” Here’s how Claude describes it:
What makes multimodal data “advanced” often involves:
Complex interdependencies between different modes
High dimensional data requiring sophisticated processing
Real-time streaming of multiple data types
Need for specialized AI/ML models to handle different modes simultaneously
Challenges in alignment and synchronization across modes
Requirements for specialized storage and processing infrastructure
Increasingly, such capabilities are manifest in the latest LLMs. For example, at the release of Gemini 2.0 in December, Google CEO Sundar Pichai referred to “new advances in multimodality, like native image and audio output.”
OpenAI’s new Sora generates video from text, images, and video. Other multimodal models include Microsoft Kosmos and Meta Perceiver IO.
The quick take: There’s a rich, varied, dynamic blending of data inputs and outputs in AI that, in addition to their awesome potential for creative work, can be a powerful tool in the hands of data engineers, data scientists, and everyone else.
Reasoning & agentic loops
The next big thing is agentic data — the data that powers AI agents and gets generated by them. As more tech teams build and deploy AI agents, they need data pipelines that feed the agents so they can perform their tasks accurately and effectively, while simultaneously learning and adapting. I think of agentic data as what gives agents their smarts, as well as the data that agents generate through their actions and interactions, which is sure to have considerable business value.
Agentic data tends to be highly complex. I will use Salesforce as an example because I’m familiar with its Agentforce strategy. Underlying Salesforce agents is the company’s Data Cloud, which aggregates data from hundreds of sources. Recent enhancements include native processing of audio and video; a semantic data model that enables agents and humans to use data consistently; and real-time data activations.
There’s also a Salesforce reasoning engine called Atlas that enables agents to learn, develop self-awareness, and reason, along with “advanced retrievers” that dig deep within sources for detailed info. The reasoning engine processes data in an agentic loop, a continuous cycle of evaluation and refinement.
Here’s how Salesforce describes its agentic loop:
“The Atlas Reasoning Engine breaks down the initial prompt into smaller tasks, evaluating at each step and proposing a plan for how to proceed. For example, if an agent is handling a customer inquiry, it identifies the intent, searches for relevant data, creates a plan of action, and evaluates the effectiveness of the action. If unsatisfactory, the agent continues to adapt and refine the plan by asking for additional information. This ensures that the initial prompt can be completed accurately.” — Salesforce
In other words, agentic data is a function of processing and analysis. The end result is an autonomous, self-driving AI agent. Looks simple to a user/customer, but the underlying data alchemy is highly sophisticated.
Synthetic data like a snow machine
One of the great ironies of AI is that, while we thought that the world was awash in too much data — petabytes, exabytes, zettabytes, and eventually yottabytes — the opposite is true. There’s actually not enough data to advance LLMs to the next level.
So the people building frontier models are now using computer-generated synthetic data to satisfy this insatiable appetite. Synthetic data is like a snow machine on a ski slope; it makes up where the real world falls short.
Artificial data is being used to train some AI models, but it has limitations. It may introduce bias or inaccuracies, and it lacks the complexity and nuance of authentic data. Which is why synthetic data is sometimes augmented and refined with human expertise, known as Reinforcement Learning from Human Feedback (RLHF).
Data stores to double
This brings me to a universal truth — the fact that data volumes keep growing year after year. AI will compound that. The chart below, created by ChatGPT, shows the trend line over the past 79 years, going back to when the term “data” came into vogue. What you see here isn’t a projection, but an approximation of what has already happened.
Now, AI is fueling more data growth, much like the Internet and mobile devices did in previous tech waves. I have floated the idea that data volumes could grow by 10x, 100x, or more, driven by the rise of vector data and new AI applications. For evidence, just look at the AI infrastructure build out that’s underway with computer clusters of 100,000+ GPUs and think about the end result of all of that processing — more data.
My original blog post, below, lays out a few of the reasons I believe this will happen, including hefty increases in CapEx spending by the hyperscalers and forecasts for a strong rebound in data-storage spending.
Since I wrote that post, there has been even more evidence of rising demand for data storage driven by AI. ITPro reports that data stores for large enterprises are expected to more than double from 150 petabytes to more than 300 petabytes by the end of 2026. See, “AI is causing a data storage crisis for enterprises.”
Storage vendor Seagate’s stock price has been on the upswing on the expectation of strong demand, according to this contributed article on the Forbes site.
All of this AI-generated data will require new thinking vis-a-vis data storage architecture. My next post will delve into the design and optimization of data storage in AI data centers.
Deeper reading
“The Rise of Agentic Data Generation,” by Maxime Labonne @ Hugging Face
“R1 is reasoning for the masses,” by Charlie Guo, Artificial Ignorance
“The Agentic Era: After the Dawn, Here’s What to Expect,” by Silvio Savarese, Chief Scientist with Salesforce AI Research