Here's a number that should concern every enterprise leader: according to McKinsey's State of AI report, 88% of enterprises now deploy AI in some capacity. Only 6% report capturing real business value from those deployments. That's an 82-point gap between having AI and benefiting from it.
The usual explanations for this gap focus on strategy, talent, and data quality. Those are real factors. But there's a structural problem that rarely gets discussed: the enterprise AI ecosystem has a massive blind spot in its lifecycle coverage. The tools, frameworks, and practices that surround AI systems today cluster overwhelmingly in the early phases - building and shipping. The phase where AI systems are actually running in production, serving real customers, and generating real regulatory exposure? That phase is largely unwatched.
Two independent research organizations - the RAND Corporation and Forrester Research - arrived at the same conclusion from different directions: continuous monitoring of AI systems in production is a critical gap that existing tools do not adequately address. Both organizations conducted their analysis independently, making the convergence more significant.
What the AI lifecycle actually looks like
RAND's AI Security Framework organizes the enterprise AI lifecycle into five stages: Design → Develop → Deploy → Operate → Retire. This isn't a theoretical model - it reflects how AI systems move through organizations. Teams design AI capabilities, develop the models and integrations, deploy them to production, operate them at scale, and eventually retire them when they're replaced or obsolete.
The problem becomes visible when you map existing tooling to each stage:
| Phase | What happens | Tooling maturity |
|---|---|---|
| Design | Threat modeling, requirements, use case definition | Mature - frameworks, consultancies, AI strategy tools |
| Develop | Model training, prompt engineering, evaluation | Mature - DeepEval, Ragas, promptfoo, Weights & Biases |
| Deploy | Model serving, API integration, infrastructure setup | Mature - observability (LangSmith, Langfuse, Arize), cloud platforms |
| Operate | Continuous production monitoring, behavioral governance, compliance evidence | Gap - no purpose-built tooling for behavioral assurance |
| Retire | Decommission, data archival, model sunset | Nascent - ad hoc processes |
The first three phases have attracted billions in venture funding and produced dozens of well-established tools. The Operate phase - the only phase where AI systems are affecting real business outcomes, real customers, and real regulatory obligations - has almost no purpose-built tooling.
Why the Operate phase is different
The Operate phase presents a fundamentally different challenge than the phases that precede it. During Design, Develop, and Deploy, teams are working with known inputs and controlled environments. They're evaluating models against benchmarks, testing integrations in staging, validating that the system works as intended before going live.
Once an AI system is in production, the conditions change. The system is now processing real-world inputs that are messier and more varied than any test suite. Model providers - OpenAI, Anthropic, Google, Meta, Mistral, and others - push updates that can subtly alter behavior, sometimes without advance notice. The distribution of inputs the system sees may shift over weeks and months. And the regulatory clock is ticking: every hour the system runs, it's generating compliance obligations the enterprise needs to be able to evidence.
This is where AI behavioral drift emerges. Not catastrophic failures that trigger alerts in your infrastructure monitoring. Gradual shifts in how the AI system responds to the same types of inputs - shifts that are invisible to observability tools because the system is technically healthy even as its outputs are changing in business-meaningful ways.
An insurance claims triage model that begins routing water damage claims to the wrong adjuster team because its classification boundary has gradually shifted. Response times are normal. Error rates look fine. Observability dashboards show green. But 15% of water damage claims are now being misclassified - and nobody knows until a customer complaint surfaces weeks later.
What existing tools miss
The natural question is: don't observability tools already cover this? They don't, and the distinction matters.
Observability platforms like LangSmith, Langfuse, and Arize are excellent at what they do. They capture traces, measure latency, log token usage, and surface errors. They answer the question: what did the AI system do?
What they don't answer is the harder question: is the AI system still doing what we approved it to do?
This distinction matters because AI systems can drift in ways that are entirely invisible to infrastructure-level monitoring. The system responds. Response times are normal. No errors are thrown. But the meaning of the outputs - the classifications, the recommendations, the risk scores - has shifted from the behavioral baseline the enterprise approved at deployment.
Detecting this kind of drift requires a different approach entirely: continuously evaluating the system's outputs against defined behavioral dimensions using methods designed specifically for detecting gradual change, not just measuring performance against static benchmarks.
Traditional software testing assumes deterministic outputs - the same input produces the same output. AI systems are non-deterministic by design. Two identical inputs can produce different but equally valid outputs. This means you can't test AI the way you test traditional software. You need statistical methods that track behavioral distributions over time, detecting when those distributions shift beyond acceptable boundaries.
What RAND found
RAND Corporation's AI Security Guide, published in February 2026 and funded by the U.S. Department of State, provides the most comprehensive lifecycle framework for AI security published to date. The guide maps security controls across all five lifecycle phases and includes an interactive risk assessment tool.
What's significant for enterprises is where RAND identifies the gaps. The Design, Develop, and Deploy phases have established security practices - threat modeling, secure coding, adversarial testing, model verification. The Operate phase guidance calls for continuous monitoring for drift, logging and alerting, governance for model updates, and incident response. But the guide itself is advisory - it tells organizations what they should be doing in the Operate phase without providing the tooling to actually do it.
This is the structural gap: RAND's framework assumes operational monitoring capabilities exist at the enterprise level. For most organizations, they don't. The tools that do exist are built for earlier lifecycle phases and repurposed imperfectly for production use.
Why Forrester reached the same conclusion
Independently of RAND, Forrester Research published the AEGIS framework - Agentic AI Enterprise Guardrails for Information Security - in 2025. AEGIS targets a different vector (enterprise agentic AI security) but arrives at a convergent finding.
AEGIS introduces the principle of "continuous assurance" as a core requirement across its six security domains: GRC, IAM, Data Security, Application Security, Threat Management, and Zero Trust. The framework explicitly calls for ongoing behavioral validation of AI systems in production - not just point-in-time assessments at deployment.
Forrester's second AEGIS report added a regulatory crosswalk mapping 39 controls to NIST AI RMF, ISO 42001, EU AI Act, and HITRUST. The controls that map to production monitoring and behavioral validation are among the most underserved - meaning the enterprise tooling ecosystem has not caught up with what regulators and frameworks already expect.
When two independent research organizations - one government-funded, one private-sector - both identify the same gap from different analytical starting points, that's a signal worth paying attention to.
RAND identifies the gap from a security lifecycle perspective: the Operate phase lacks purpose-built tooling. Forrester identifies it from an enterprise governance perspective: continuous assurance is a stated principle that existing tools don't deliver. Both point to the same structural absence in the enterprise AI ecosystem.
The regulatory dimension
This gap isn't just a tooling problem. It's a compliance problem - and regulators aren't waiting for the tooling ecosystem to catch up.
The NAIC Model Bulletin on AI - now adopted in over 24 U.S. states - requires insurance companies to maintain governance frameworks that include ongoing monitoring of AI systems used in coverage decisions, claims handling, and underwriting. The OCC's SR 11-7 guidance mandates model risk management including ongoing validation for banks. The EU AI Act's enforcement timeline is underway, with high-risk AI obligations that include continuous post-market surveillance.
All of these frameworks assume that enterprises have the capability to continuously monitor AI behavior in production and produce evidence that those systems remain within approved parameters. The assumption is ahead of reality. Most enterprises today can tell you what their AI systems did (observability) but cannot prove that their AI systems are still behaving as approved (behavioral assurance).
When a regulator asks "how do you know your AI claims triage system is performing within approved parameters?", the answer can't be "we checked it when we deployed it." The answer needs to be an auditable evidence package showing continuous monitoring results, behavioral scores tracked over time, drift events detected and resolved, and attestations for specific reporting periods.
Why this gap persists
If the gap is so clear, why hasn't it been filled? Three structural reasons:
The Operate phase is harder than it looks. Pre-deployment evaluation is a bounded problem - you have a known dataset, controlled conditions, and a clear pass/fail threshold. Production behavioral monitoring is unbounded - inputs are unpredictable, baselines shift, and you need statistical methods that distinguish meaningful behavioral change from normal AI output variance. You can't just run your deployment test suite on a schedule and call it monitoring.
The buyers are different from the builders. The people who buy AI monitoring tools for the Operate phase - CISOs, Chief Risk Officers, compliance leaders - are not the same people who buy development and deployment tools. They have different vocabularies, different procurement processes, different success criteria. A tool built for ML engineers doesn't serve a compliance officer's needs, even if the underlying data is related.
Cloud providers have a conflict of interest. AWS, Azure, and GCP all offer monitoring capabilities for AI services running on their platforms. But asking a cloud provider to independently validate the behavior of AI systems they host creates a "grading your own homework" problem. The same conflict applies to model providers themselves - OpenAI monitoring the behavioral drift of GPT-4o deployments, or Google assessing whether Gemini outputs have shifted from approved baselines, means the entity being monitored is also the entity reporting on its own performance. Regulators are already skeptical of self-reported compliance. Independent, third-party behavioral monitoring is what regulatory frameworks actually require.
What closing the gap requires
Closing the AI lifecycle gap isn't about adding features to existing tools. Observability platforms are not behavioral assurance platforms - they're solving a different problem with different methods for different buyers. The Operate phase needs purpose-built capabilities:
Continuous behavioral baselines. Not one-time benchmarks, but living baselines that capture how an AI system behaves across defined dimensions - accuracy, tone, compliance adherence, routing correctness, boundary respect - and track those dimensions over time.
Statistical drift detection. Methods designed specifically for identifying gradual behavioral change in non-deterministic systems. Traditional threshold-based alerting catches catastrophic failures. Behavioral drift requires statistical process control techniques that accumulate evidence of directional change and trigger alerts only when the cumulative signal crosses meaningful thresholds.
Compliance evidence generation. Automated production of regulator-ready evidence packages that map directly to NAIC, SR 11-7, EU AI Act, NIST AI RMF, and Forrester AEGIS framework requirements. Not dashboards - documents that a compliance officer can submit in response to a regulatory inquiry.
Platform independence. Monitoring that works across AI providers, because enterprises increasingly use multiple models from multiple providers. And monitoring that is independent of the provider being monitored, because regulatory credibility requires it.
The enterprise that deploys AI without continuous behavioral monitoring is making the same bet as the enterprise that deploys software without security monitoring. It might work for a while. But the risks compound with every passing day, and the regulatory expectation of continuous assurance isn't going away.
Is your AI still behaving as approved?
AnchorDrift provides continuous AI behavioral assurance for regulated enterprises. We're onboarding customers now.
Book a Discovery Call