Timothy Wong

From ML Models to AI Analytics Systems: What Transfers, and What Doesn't

ML platform discipline transfers to AI analytics systems — feature stores become semantic layers, drift detection still applies. The serving layer is where the bar goes up, and eval is the piece without a shared standard.

April 20, 2026

A month ago I wrote that the hard part in analytics isn’t building dashboards - it’s turning insight into decisions at scale. What I called agentic analytics.

That post raised a technical question from several finance and data leaders: how do you actually build these systems in production? Is it any different from a traditional BI system?

One of my teams has been building AI analytics systems for the past couple months. A combination of the BI and ML platform playbooks transfers - especially ML concepts like drift detection, champion/challenger deployment, shadow testing, and continuous evaluation. The break starts earlier than most people expect - in how you represent business meaning, manage context, and handle retrieval. But the biggest divergence is at the serving layer. And eval is the part no one has a shared standard for yet.

Three layers, two of them familiar

Most AI analytics stacks I’ve looked at have the same three layers we had for ML:

→ Data ingestion and processing → Transformation and abstraction → Serving

Layer one: data. Raw events, transactions, logs, documents. The same warehouses and pipelines feed both. But AI analytics systems pull more on unstructured artifacts, metadata, and retrieval indices - so even when the source data is shared, the ingestion and indexing patterns expand.

Layer two: transformation. This is where you create consistency and make data digestible. Same purpose as before, but the outputs change.

For ML, you build feature stores. Structured features with defined schemas, distributions, and freshness guarantees.

For AI analytics systems, you build a semantic layer, skills, and a company knowledge base. Metric definitions, dimension hierarchies, step-by-step analysis patterns, golden query examples, terminology and business rules.

Same investment. Different artifacts. We had some of this before - most teams just didn’t treat it as non-negotiable. When humans interpreted every answer, a loose semantic layer was fine. AI cannot carry context in its head. It needs context built, maintained, and governed. Otherwise it confidently gives wrong answers.

Two things are genuinely new at this layer.

Version control is non-negotiable. A metric definition that shifts without tracking is a silent quality bug. A skill updated without versioning breaks golden queries downstream. Every artifact - metrics, skills, knowledge - needs change history, rollback, and diff review. Same discipline as code. Most teams ship this layer in Notion docs and Slack threads. That doesn’t scale.

Non-technical users have to contribute. This is the shift most teams underestimate. When the semantic layer was features for ML models, only data scientists and engineers touched it. Now finance owns metric definitions. Ops owns playbook logic. Product owns terminology. The people who know the business rules need to edit them.

That means non-technical owners need contribution interfaces they can actually use. Natural language is one obvious answer - it fits how humans describe business rules and how AI consumes context. The governed source of truth may still be structured, but the authoring surface can’t require engineering fluency. Democratised contribution with governance on quality - not a slogan, a design requirement.

Layer three: serving. This is where ML models and AI analytics systems diverge. Not a little. A lot.

An AI analytics system is really a harness wrapped around a model - runtime, tools, context, memory, guardrails, eval loop. The model is the engine. The harness is the system. That’s where the work is now.

The ultimate problem at this layer: making sure the AI picks the right context, at the right time, for the right scenario. You can have a well-governed semantic layer, versioned skills, and clean knowledge base - but if the system can’t fetch the right pieces at inference time, none of it matters. As you scale across the company, context window limits and memory management become real constraints. The governed context exists. Getting it into the model when it’s needed is the engineering problem.

The serving layer is where the bar goes up

Six dimensions to think about:

Runtime. An ML model is a function. Input goes in, prediction comes out. The harness around an AI analytics system has tool access to the warehouse, internal APIs, and company systems. It plans, calls tools, reasons across steps. The failure mode shifts from “wrong prediction” to “wrong action” - a bad query, a misinterpreted metric, a report that tells leadership something untrue.

Observability. ML observability tracks feature drift, prediction drift, input distribution shifts. AI analytics observability has to track context drift, tool usage drift, and execution-path drift. Why did the system pick this table over that one? Why did it interpret “revenue” as GMV in one session and net revenue in another? Traditional ML monitoring was insufficient for this - which is why a new wave of observability primitives emerged over the past two years.

Backtest. For ML, you replay historical predictions against ground truth. AI analytics systems often need session-level replay or controlled-environment evaluation - with tool calls sandboxed, context reconstructed, reasoning paths evaluated. It’s an order of magnitude harder. The tooling for this is still maturing.

Deployment patterns. Champion/challenger, A/B testing, shadow deployment, canary releases - the concepts transfer directly. The measurement doesn’t. With a model, success metrics are usually well-defined. With an AI analytics system, what’s the success metric? Correctness? Latency? User trust? All of the above, and none of them are clean.

Context retrieval. This is the dimension most teams underestimate. The system has to select the right context from the right source at the right time - the correct metric definition, the relevant skill, the appropriate business rules for this specific query. At small scale this is manageable. As you compound across teams, domains, and use cases, context window limits force hard tradeoffs. You can’t fit everything in. The system has to decide what’s relevant, and that decision is itself a failure surface. A well-governed context layer that the model can’t reach at inference time is worse than no context layer - because you think you’re covered when you’re not.

Eval. This is the one I’ll spend the most time on, because it’s where we’ve invested the most and it’s still the hardest piece without a shared standard.

Why eval is the hardest unsolved piece

The core question looks simple: does the system give the same answer to the same question, consistently?

It doesn’t. And there’s no shared standard yet for evaluating correctness, reasoning quality, and business usefulness end-to-end. This is an industry-hard problem.

Five reasons eval is harder for AI analytics systems than for ML models:

Consistency. The same question asked twice can produce different SQL, different joins, different interpretations. Non-determinism compounds across multi-step reasoning. You can reduce it with temperature controls and strict prompts, but you cannot eliminate it.

Correctness without a single ground truth. Some subcomponents have clear ground truth - SQL validity, metric agreement with the semantic layer, citation correctness, constraint adherence. The overall analytic judgment often doesn’t. “What’s driving the revenue miss in Q3” has no single correct answer. Two good analysts would write two different reports. Eval has to grade reasoning, not just outputs.

Reasoning quality, not final answer. A system can arrive at the right number through wrong logic. That’s worse than a wrong number, because it’s undetectable. Eval has to inspect the path, not just the destination.

LLM-as-judge has its own biases. Using a model to evaluate a model is the pragmatic shortcut. It also introduces judge drift, judge preference, and a whole new layer of subjective calibration. You end up building eval for your eval.

Golden datasets go stale faster. ML eval sets can degrade under concept drift, but they usually last months. Golden queries for AI analytics systems can go stale much faster - when the semantic layer shifts, a new data source comes online, or the business asks a new kind of question. Different rate, different mechanism.

Here’s what we’re building to close the gap:

→ Validation quality gates before answers reach end users → Golden query examples as few-shot references, versioned with the semantic layer → Guardrails that make the system refuse rather than hallucinate when confidence is low → Feedback loops where the most-asked queries improve context automatically

None of this is solved. We ship it, watch it, fix it, ship again. That’s the state of the art.

The head start most teams don’t realize they have

If you’ve built production ML platforms before, you already understand feature stores, drift detection, champion/challenger patterns, shadow deployments. Most of that transfers. The layer one and layer two problems are the same problems you solved five years ago - with two genuinely new requirements: version control across semantic artifacts, and contribution interfaces for non-technical owners.

The serving layer is where the object being governed gets broader and less deterministic. The discipline is familiar, but the surface area expands - runtime becomes tool-using, observability becomes trajectory-level, eval becomes system-level rather than model-level.

The principles transfer. The bar is higher. And eval is still the piece without a shared standard.