The Geometry of Trust: Runtime Verification for AI Agents

Every generated response is a computation that has never been executed before and whose properties are unknown until it exists. The appropriate operational model is: generate, verify, trust, then execute.

01 — The Assumption That Broke

Enterprise AI deployments are almost universally built on an assumption that AI systems can be governed the way deterministic software has been governed for thirty years. Test before you ship. Gate through a deployment pipeline. Trust once it passes.

Traditional software earns that trust because its logic is explicit and its outputs, given identical inputs, are predictable. Testing can, in principle, approach completeness — which is why we trust banking systems, flight control software, and the infrastructure running global commerce.

AI agents work differently. Their outputs shift with retrieved context, with the tools they are handed, and with the precise phrasing of the instructions they receive. Two identical queries can produce materially different outputs. The same model, given the same prompt, may behave differently after a retrieval step surfaces different documents. This is the nature of probabilistic computation, and it means pre-deployment testing cannot fully cover runtime behavior — no evaluation suite can anticipate every combination of user input, retrieved context, tool output, and model reasoning that arises in production.

Organizations still operating on the old model are accumulating risk with each deployment. Outputs are trusted because there is no architectural mechanism to verify them. Agentic systems take real-world actions with nothing between generation and execution.

02 — What Verification Requires

If pre-deployment testing cannot cover the full distribution of runtime behavior, the question becomes: what kind of verification can?

The answer requires asking a different question. Traditional software testing asks whether a function produced the correct output — tractable because the output is fixed given the input. For probabilistic systems, the equivalent question — is this output correct? — is slow, expensive, and domain-dependent. It requires semantic judgment, which means routing every consequential output through a second probabilistic component.

The tractable question is: were the conditions under which this output was produced ones the system can defend? This question does not require knowing whether the answer is right. It requires reading the computation itself.

A language model generating a response leaves a measurable trace at every step. The distribution of probability mass across the vocabulary at each token reveals whether the model was generating with conviction or uncertainty. The position of an incoming request in embedding space reveals whether it resembles inputs the system has handled reliably. The stability of outputs across multiple samples reveals whether the model is operating near the edge of its reliable capability. These are properties of the computation — not the content — and they are available without any domain knowledge.

This is the shift that makes verification viable at production scale: from grading outputs after the fact to reading the probability cloud as it forms. The output is what the system produced. The computation is the evidence that it was produced under sound conditions.

03 — The Geometry Principle

AI models, as they compute, leave a geometric trace. Token probability distributions. Embedding vectors in high-dimensional space. Entropy surfaces across generation steps. These are structural properties of the computation itself, and they are measurable without knowing anything about the subject matter.

Cosine similarity between a user request and a tool description does not require knowing what either one means — only that the angle between their embedding vectors is small. Whether the probability distribution over the vocabulary is sharp or flat at a given generation step is a geometric fact. Whether two outputs agree when generated at non-zero temperature is a consistency measurement that applies equally to a legal contract and a customer support reply.

The model performs semantic understanding when it encodes language into embeddings. The verification framework performs geometry on the resulting representations. This separation makes verification portable, fast, and independent of domain-specific knowledge.

This principle produces three scores, kept separate by design. Compressing them into one number — as most deployed verification systems do — produces a signal too blunt to act on: it cannot distinguish high-confidence/high-risk (requires a human gate) from low-confidence/low-risk (proceed with a caveat).

Score	Question	What it measures
Confidence	Did the system know what it didn’t know?	The shape of the probability surface during generation — moment by moment as the model selected tokens. A property of generation, not content. Low confidence doesn’t mean wrong; it means the system cannot strongly vouch for it.
Correctness	Does the output follow from its premises?	Internal consistency, contextual grounding against retrieved context, structural validity, instruction adherence. Deductive correctness — establishing factual truth requires a ground-truth referent external to the system.
Risk	What happens downstream if this is wrong?	How far the input sits from the validated operating envelope, whether the proposed action touches high-consequence surfaces, the blast radius if the response is flawed. A property of position relative to validated experience.

A response can be high-confidence and high-risk — the system is certain, but the action it proposes is consequential enough to require approval. A response can be low-confidence and low-risk — uncertainty warrants a caveat, but nothing dangerous proceeds. Keeping the scores separate is what gives the downstream system enough information to make that distinction.

The critical anomaly. High Confidence + Low Correctness is the most dangerous failure mode in production AI systems. The model is certain. The output is wrong. Safety classifiers pass it because it violates no policy. Anomaly detectors pass it because the model is behaving normally. Only a system that measures correctness independently of confidence can surface it — which is why collapsing the three scores into one creates an architectural blind spot.

04 — The Signal Pipeline

The three scores are computed from five categories of signal, each derived from a different stage of the request lifecycle, ordered by when they become available and by computational cost. That ordering is the mechanism that makes comprehensive verification viable without adding prohibitive latency to every request.

Category 1 — Input Signals. Computed before a single token is generated. Cheapest signals, most leverage — they set the risk tier that governs everything downstream.

Signal	What it does
Distribution Proximity	Embeds the input and compares it against the eval corpus via cosine similarity. Produces a coverage confidence score: how well does tested behavior cover this class of input? Under 5ms at scale.
Intent Classification	Probability distribution over the intent taxonomy. High entropy is as informative as the winning class — it signals the model will struggle to disambiguate. Maps to risk tier: retrieval vs. data modification vs. financial transaction.
Adversarial Pattern Detection	Runs on both raw input and a canonicalized version (normalized unicode, stripped formatting, decoded encodings). A high score overrides all other signals. False negatives are significantly more costly than false positives.
Sensitive Data Detection	NER for PII categories, regex for structured patterns. Presence is a hard Risk contribution regardless of other scores — any mishandling carries regulatory consequence.

Category 2 — Output Metadata Signals. Derived from the generation process itself, available immediately post-generation. Purest geometric signals in the pipeline — they cost nothing beyond what the model already produces.

Signal	What it does
Token Log Probabilities	Token-by-token view of the model’s conviction. A rolling window flagged below a calibrated threshold. Low-probability tokens in factual claims are weighted more heavily than in filler phrases.
Token Entropy	Spread of uncertainty at each generation step. Correlated with log probability but not identical: a token can be low-probability but low-entropy (model preferred a different token), or high-probability but high-entropy (slightly preferred over many close alternatives).
Self-Consistency Across Samples	Generate 3–5 responses at non-zero temperature; compare factual claims, structure, values, conclusions. High variance is a direct Confidence signal. Most expensive metadata check — applied selectively at Tier 3.
Stop Reason / Truncation	A truncated response is never a correct response. Deterministic completeness check.

Category 3 — Output Content Signals. Require interpreting what the model produced. More expensive, but reach dimensions metadata cannot: deductive validity, structural conformance, policy compliance.

Signal	What it does
Factual Grounding (NLI)	Extracts atomic claims; runs entailment checks against retrieved context. Contradicted claims are a hard failure signal; ungrounded claims are candidate hallucinations. NLI is itself geometric — entailment measured in embedding space.
Policy Compliance	Runs output against a machine-readable policy specification. The enforcement mechanism is domain-agnostic; the specification is the domain-specific layer. A natural language policy document cannot enforce itself at runtime.
Structure Validation	Schema validation, value range checks, function signature verification for structured outputs. A malformed function call or invalid JSON is deterministically verifiable — no probabilistic judgment required.
Semantic Coherence	Internal consistency of the output’s reasoning chain, independent of factual content.
Safety Classification	Explicit constraint and safety policy violations. Off-the-shelf classifiers apply here.
Completeness	Whether the output addresses the full scope of the request.

Category 4 — Execution Trace Signals. For multi-step agents, the execution trace carries signals invisible in the final output alone.

Signal	What it does
Tool Call Validation	Checks each call against a registry of intended use cases, valid argument ranges, and dangerous patterns — before execution. Highest-value signal for agentic systems: it gates action rather than reviewing it after the fact.
Reasoning Coherence	Compares stated reasoning against subsequent actions via embedding-space comparison. An agent that states it will look up an account balance and then calls a product catalog API is geometrically misaligned — detectable without domain knowledge.
Retrieval Relevance	Whether retrieved context is actually relevant to the query that triggered retrieval.

Category 5 — Input-Output Relationship Signals. Neither input nor output signals capture the fidelity of the mapping between them.

Signal	What it does
Instruction Adherence	Whether explicit constraints stated in the input are satisfied in the output.
Relevance Alignment	Semantic similarity between the input’s core intent and the output’s content. A response that is topically related but answers a different question scores low here — catching failures all other categories miss.

Signal-to-Score Mapping

Signal	Conf.	Corr.	Risk
Category 1 — Input
Distribution Proximity	●	—	◐
Intent Classification	○	—	●
Adversarial Detection	—	—	●
Sensitive Data	—	—	●
Category 2 — Output Metadata
Token Log Probabilities	●	—	—
Token Entropy	●	—	—
Self-Consistency (N samples)	●	◐	—
Stop Reason / Truncation	◐	◐	—
Category 3 — Output Content
Factual Grounding (NLI)	◐	●	—
Semantic Coherence	—	●	—
Policy Compliance	—	◐	●
Structure Validation	—	●	—
Safety Classification	—	—	●
Completeness	—	●	—
Category 4 — Execution Trace
Tool Call Validation	○	●	●
Reasoning Coherence	—	●	◐
Retrieval Relevance	◐	●	—
Category 5 — Input-Output
Instruction Adherence	—	●	—
Relevance Alignment	—	●	—

● high weight · ◐ medium · ○ low · — no contribution

05 — Cost-Proportional Tiering

Running every check on every request is not viable. A system that adds 500ms to every interaction will be bypassed the first time someone complains about response time.

The architecture handles this through tiered verification: cheap, deterministic input signals set a tier on every request, and progressively more expensive checks apply only at the tier that warrants them. The tier is determined geometrically — distribution proximity and intent classification together place the request on the risk surface, and verification depth scales accordingly.

Tier	Budget	Trigger	Checks applied
1	< 70ms	Close distribution, benign intent, no adversarial signals	Safety classification, structure validation, stop reason
2	< 200ms	Moderately novel input, data modification, sensitive data present	All Tier 1 + factual grounding, policy compliance, instruction adherence, relevance alignment, completeness
3	< 500ms	Far from validated distribution, high-risk intent, adversarial patterns	All Tier 2 + semantic coherence, self-consistency sampling, full execution trace analysis

For the highest-risk requests, a human approval gate replaces the latency budget entirely.

If Tier 3 is consuming a disproportionate share of requests, the system is operating outside its validated envelope. The right response is expanding the eval corpus — not relaxing the verification thresholds.

06 — The Enforcement Engine

Three scores computed from twenty-plus signals matter only if they translate into decisions. The enforcement engine maps the three-score matrix to one of four hard runtime actions on every request.

Condition	Action
High Confidence · High Correctness · Low Risk	EXECUTE
High Confidence · High Correctness · High Risk	GATE
High Confidence · Low Correctness · Any Risk	BLOCK
Low Confidence · Any Correctness · Low Risk	EXECUTE + CAVEAT
Low Confidence · Any Correctness · High Risk	ESCALATE
Any Confidence · Low Correctness · High Risk	BLOCK
Policy Violation Detected · Any Scores	BLOCK

The GATE action warrants attention. It holds the request pending human approval or secondary verification. It applies when a request is correct and well-reasoned but consequential enough that unilateral execution is inappropriate. This is bounded autonomy in practice: the system does not refuse to act, it requires authorization proportional to consequence before acting.

The BLOCK path on high-confidence, low-correctness responses is the most important one to instrument for post-hoc analysis. When the system catches itself being confidently wrong, and that pattern recurs, the source is either a systematic model failure or a gap in the eval corpus — both of which require attention at the architecture level, not the prompt level.

Governance becomes operational. A policy document that prohibits certain uses of AI has no mechanism to enforce that prohibition at inference time. A review board that meets quarterly cannot respond to a policy violation that occurs at two in the morning on a Tuesday. The verification layer enforces policy on every output, at inference speed, continuously. This moves governance from a periodic organizational exercise to a property of the system itself. Both are necessary. Only the second scales.

07 — For Practitioners

The full pipeline is not something you build in a sprint. A minimum viable verification layer that covers the most critical failure modes is achievable with six signals and tiered routing, and it gives you a foundation that extends naturally as the system grows.

Begin with intent classification and adversarial detection on the input side. These two signals determine how much downstream verification everything else receives, and whether any downstream verification is bypassed entirely by a manipulation attempt. Both run in under 20ms and can be added to an existing pipeline without architectural change.

Add token log probabilities and stop reason from the output metadata. Log probabilities are available from most model APIs at no additional cost and carry the strongest single Confidence signal in the pipeline. Stop reason is a deterministic completeness check — a truncated response is never a correct response.

Add safety classification and structure validation on the output side. Safety classifiers are available off the shelf. Structure validation requires only a schema definition and applies wherever the system produces JSON, function calls, or SQL.

These six signals address prompt injection, truncated outputs, unsafe content, schema errors, and the confidence floor below which outputs should carry a caveat rather than execute. From this base: add factual grounding when you introduce RAG, tool call validation when you introduce agents, and self-consistency sampling for the highest-stakes requests in Tier 3.

Each addition should be justified by a specific failure mode it addresses, validated against labeled data, and measured for its effect on score quality. The pipeline improves as it runs — every production failure that gets reviewed and labeled becomes a test case that tightens the next version.

On LLM-as-judge. Using a second model call to evaluate the first model’s output is flexible but slow, expensive, and adds a second probabilistic component to a path that should be as deterministic as possible. This architecture treats LLM-as-judge as a fallback for signals that cannot be reduced to geometric operations. When it becomes the primary verification mechanism, that is a sign that geometric verification has not yet been designed in.

08 — For CXOs

Foundation models are commoditizing. The capability gap between frontier models and capable open-source alternatives is narrowing. In that environment, the organizations that pull ahead will be the ones that can deploy AI reliably at scale — not the ones with the most capable model, but the ones whose systems behave predictably, comply with policy, and earn the confidence of the people and institutions depending on them.

Runtime verification is the infrastructure that makes that possible. Without it, expanding agent autonomy means accumulating exposure with each deployment. With it, autonomy can be extended in proportion to demonstrated reliability — the system earns broader scope by performing well within a narrower one, and the verification layer provides the evidence that performance is real.

The practical implication for AI investment decisions: the model is not the scarce resource. The operational layer is. Procurement, fine-tuning, and prompt engineering are well-understood. The capability to verify outputs at runtime, enforce policy continuously, and gate autonomous action based on measured risk is where the actual differentiation will compound over the next several years.

The underlying principle is correctness-by-governance rather than correctness-by-construction. Trust earned on every transaction, verification built into the architecture, governance enforced at inference speed. Organizations that get this right will extend AI autonomy as reliability is demonstrated, not as a leap of faith.

The three-score verification pipeline described here is one pillar of the broader Agentic Runtime Governance framework. The Runtime Governance Checklist provides a structured starting point for assessing your current architectural posture.