Trusting AI: evaluation as engineering discipline

For decades, software quality has been a solved organizational problem, or at least a well-understood one. Teams write tests. Tests run automatically. When a change breaks something, the pipeline catches it before it reaches production. This discipline, built up painfully over thirty years of software engineering practice, is why modern development teams can ship multiple times per day with confidence. The feedback loop is fast, the failure modes are visible and the quality bar is enforced continuously rather than checked at the end.

AI breaks this entirely. When a software system returns the wrong value, you see it immediately. When an AI system returns a plausible-sounding but subtly wrong answer, it looks identical to a correct one. Your monitoring dashboard shows acceptable latency. Your error rate is fine. Your users are quietly receiving incorrect outputs and you have no instrument that tells you so. This is the central problem of AI quality assurance and it’s more consequential than most organizations have recognized. Braintrust, which operates in this space, cites industry estimates of nearly two billion dollars in annual losses due to undetected LLM failures and quality issues in production. These losses are largely invisible precisely because the failures are.

The underlying issue is that traditional software testing assumes deterministic behavior. You define an input, you specify the expected output, you verify the result. This works because the same input always produces the same output. AI systems are fundamentally probabilistic. The same prompt, sent twice, can return different responses. A prompt change that improves performance on one class of inputs can silently degrade performance on another. What passes every test in development can behave differently in production, where the distribution of real user inputs differs from any test set you constructed in advance. You can’t unit-test your way to a reliable AI product.

This isn’t a problem that will resolve itself as models improve. If anything, it intensifies as AI systems become more capable and organizations deploy them in more consequential workflows. A medical documentation system that occasionally introduces errors, a customer service agent that gives inconsistent policy advice, a financial analysis tool that drifts subtly as the underlying model is updated – these aren’t theoretical risks; they’re operational realities for organizations that have deployed AI without building the evaluation infrastructure to detect and correct them.

A new category of tooling has emerged to address this gap. The field is sometimes called LLMOps, borrowing from the MLOps tradition, and it’s developing rapidly. What distinguishes the leading platforms is a shared philosophical approach: Evaluation must be treated as a first-class engineering discipline, not an afterthought. This means defining what “good” looks like before you ship, running systematic experiments against representative datasets, catching regressions automatically as part of the development pipeline and monitoring production continuously for quality drift, not just latency and errors.

Braintrust is among the most mature examples of this approach. Used by product teams at companies such as Notion, Stripe and Zapier, it’s built around the idea that AI quality assurance should mirror the practices software teams already apply to code: version control for prompts, automated regression detection, side-by-side comparison of model or prompt variants against consistent datasets. The impact is measurable in concrete terms. Notion reported moving from resolving three quality issues per day to thirty after adopting systematic evaluation workflows. That tenfold improvement reflects not just better tooling but a fundamentally different relationship with AI quality: one in which problems are discovered and fixed systematically rather than surfaced by unhappy users.

Arize takes a complementary approach, oriented less toward the development loop and more toward production monitoring. It inherits a strong background in traditional ML observability, including drift detection, bias monitoring and performance tracking, and has extended these capabilities to language model deployments. The key insight Arize operationalizes is that AI systems can degrade silently over time, as the distribution of real-world inputs shifts, upstream models are updated or the system is used in ways its designers didn’t anticipate. Catching this kind of slow degradation requires continuous monitoring of output quality, not just infrastructure health. For enterprises in regulated industries, Arize also provides the audit trails and compliance controls that governance teams require before they can approve AI in production workflows.

Coval represents a different and particularly interesting architectural approach. Founded on the premise that AI agent testing should borrow from how the autonomous-vehicle industry validates self-driving systems, it provides simulation and evaluation infrastructure for AI agents before they’re deployed to real users. The analogy is precise and useful: You don’t wait for a self-driving car to encounter an edge case on a real road to discover that it handles the situation poorly; you simulate millions of scenarios in advance, stress-test the system against edge cases and validate behavior before anything reaches production. Applied to AI agents, this means testing not just individual model responses but complete multi-step workflows, across diverse simulated conditions, at a scale and speed that human testers can’t match.

What connects these platforms is a deeper economic principle that organizations are beginning to recognize. Evaluation infrastructure doesn’t merely catch errors; it generates data. Every evaluation run produces a labelled set of inputs and outputs, with quality scores attached. That labelled data can be used to improve the next version of the model or prompt. Better evaluation produces better improvement signals. Better improvement signals produce better AI. The organization that builds rigorous evaluation infrastructure isn’t just reducing errors in the short term; it’s compounding its ability to improve AI quality over time. The organization that skips evaluation isn’t just accepting current errors; it’s foregoing the data it needs to learn.

This compounding effect creates the same kind of structural separation that has appeared in other areas of AI-driven competition. As I described in the context of compliance and data infrastructure, the advantage accrues not from a single capability but from an architecture that improves continuously. Organizations that invest in evaluation early develop operational expertise, labelled datasets and quality benchmarks that are difficult to replicate quickly. The gap between organizations that treat AI evaluation as an engineering discipline and those that rely on ad hoc testing widens with every deployment cycle.

The McKinsey 2026 AI Trust Maturity Survey captures where most organizations currently sit. Despite significant investment in AI deployment, only about one-third of organizations report meaningful maturity in AI governance and agentic AI oversight. Governance and control structures are lagging systematically behind technical deployment, across every region and industry. This isn’t a temporary imbalance; it reflects the fact that organizations have prioritized getting AI into production over building the infrastructure to know what their AI is actually doing.

The parallel to test-driven development in software engineering is instructive. When automated testing was first proposed as a discipline, it was widely resisted as overhead. Teams argued that writing tests slowed development down, that experienced engineers didn’t need them, that the cost wasn’t justified. Over time, the evidence proved otherwise. Organizations that adopted automated testing shipped faster, with fewer defects and more predictable quality. Testing stopped being overhead and became the mechanism by which software teams maintained velocity as systems grew more complex. Today, a team that ships without automated tests is considered reckless.

The same transition is now underway for AI. Organizations that treat evaluation as overhead will find themselves constrained by the quality problems they can’t see and can’t fix systematically. Organizations that treat it as infrastructure will find themselves able to deploy AI faster, improve it more rapidly and build the trust with customers and regulators that determines adoption in consequential domains. The direction of travel isn’t in doubt; what varies is how quickly different organizations recognize it.

Ultimately, the evaluation infrastructure question is a trust question. AI systems that can’t demonstrate their own reliability can’t be given responsibility for consequential decisions. As AI moves from experimental features into core workflows, from supporting human decisions to making them, the ability to prove that a system behaves as intended, consistently and over time, becomes the condition for deployment rather than a nice-to-have. Trust, in this context, isn’t a feeling; it’s an engineering output. And like most engineering outputs, it requires deliberate investment, proper tooling and continuous measurement to produce. To end with John Ruskin: “Quality is never an accident; it’s always the result of intelligent effort.”

Want to read more like this? Sign up for my newsletter at jan@janbosch.com or follow me on janbosch.com/blog, LinkedIn (linkedin.com/in/janbosch) or X (@JanBosch).