Data infrastructure and AI infrastructure are different disciplines

I lead architecture for a platform that is half data infrastructure and half AI infrastructure, and the most useful design decision we made was refusing to pretend those are the same discipline. They optimize for different things, they fail in different ways, they evolve at different speeds, and a platform that blurs them inherits the worst properties of both.

Data infrastructure is a discipline of correctness over time. Ingestion, storage, transformation, semantics, serving: the values it optimizes are durability, consistency, governance, and the ability to answer the same question the same way next year. Its artifacts are schemas, contracts, and lineage. Its enemies are silent corruption and drift. When data infrastructure fails, it fails quietly and compounds: a wrong number propagates through every downstream system that trusted it, and the cost of the failure grows with the distance from its source.

AI infrastructure is a discipline of iteration under nondeterminism. Retrieval, agent execution, evaluation, inference: the values it optimizes are iteration speed, measurability, and graceful behavior when the component at the center of the system is probabilistic. Its artifacts are eval sets, prompts, and model configurations, all of which have a shelf life measured in months. Its enemies are unmeasured regression and false confidence. When AI infrastructure fails, it fails loudly and individually: a wrong answer to a user, visible immediately, traceable to a single interaction.

Those profiles are close to opposites. One side wants stability and resists change; the other side is rebuilt every time a better model ships. One side’s correctness is binary and provable; the other side’s correctness is statistical and argued from evaluation. Treating them as one discipline produces familiar pathologies. Teams apply data-engineering change control to prompts and model upgrades, and iteration dies under process designed for schema migrations. Or, more often, teams apply AI-engineering looseness to data, and the foundation rots: pipelines with no contracts, semantics that live in a prompt, lineage that ends at the context window.

The architectural consequence is a clean line. On our platform, roughly fifteen open-source data and AI systems had to become one coherent product, and the only way that holds together is contract-first boundaries: every subsystem behind an explicit API surface, one identity flowing through every layer, no subsystem reaching into another’s internals. The line is what keeps the fast-moving half from destabilizing the slow-moving half. It is also what keeps subsystems swappable, which in the AI half is a survival requirement, because the best component for any AI job changes every couple of quarters.

The interesting part is what has to cross the line.

Three things cross, and they are the hardest problems on the platform. Semantics cross: the AI half is only useful if it understands what the data means, and meaning lives in the data half, in schemas and metrics and business definitions. An agent that queries a table it does not semantically understand produces fluent nonsense. Identity crosses: a user’s authority has to follow their request from the chat interface down through the agent, into the query engine, and back, or governance is theater. And provenance crosses in the other direction: every answer the AI half produces has to carry pointers back into the data half, to the page or the row it came from, or the answer cannot be audited and therefore cannot be trusted.

Notice what those three have in common. None of them is owned by either half. Each one is a thread that must run unbroken through both, and each one breaks at the seam if the seam is sloppy. This is why I think of the boundary as the place where the platform’s real engineering happens. Anyone can stand up a vector store next to a warehouse. The work is making meaning, authority, and evidence survive the crossing.

The discipline split also turns out to be an organizational claim. The two halves reward different instincts: data infrastructure rewards conservatism, AI infrastructure rewards experimentation, and engineers tend to be miscast in one direction or the other. Drawing the line in the architecture lets you draw it in ownership too, and then deliberately staff the seam.

Fifteen years ago the equivalent line ran between databases and applications, and the platforms that won were the ones that made the boundary precise and the crossing cheap. The same pattern is repeating with data and AI.