An AI system is not governable because it has a governance policy. It is governable because its architecture makes governance decisions observable and enforceable. This checklist covers twelve architectural decisions that determine whether a production AI system can be governed effectively.
Use Case Classification
1. Is every LLM use case classified by output criticality and reversibility?
Governance depth should be proportional to consequence. A classification decision made before pipeline design prevents both over-verification (latency drag on low-stakes use cases) and under-verification (insufficient controls on high-stakes ones). See the LLM Risk Tier Matrix for the classification framework.
2. Is tier assignment documented and attached to the use case, not the model?
Tier is a property of what happens downstream of the output, not the model’s capability. A model upgrade does not change the tier. The tier should be documented in the use case specification and reviewed whenever the use case scope changes.
Verification Architecture
3. Is there a verification step between generation and action for every Tier 2+ use case?
A monitoring dashboard is not a verification step. Verification is a gate: an assessment of generation conditions that must pass before the output proceeds to an action. Logging is not sufficient. The gate must be in the code path.
4. Does verification measure generation conditions, not just output characteristics?
Output inspection catches visible policy violations. It misses the failure mode that matters most: an output that looks right but was produced under conditions that made wrong answers more likely. Verification should include coverage, confidence (generation entropy), and consistency (cross-sample stability).
5. Is there an escalation path when verification fails?
Failed verification should not be a silent log entry. It should trigger a defined response: human review for high-stakes cases, fallback to a safer response for lower-stakes cases, or circuit-break for cases where no safe fallback exists. Every escalation path should be tested before it is needed.
Observability
6. Does every output that triggers an action produce a log entry with generation conditions?
When an incident occurs, the reconstruction question is “which layer failed to catch this?” That question requires logs capturing layer state at the time of the incident: coverage score, entropy measurement, consistency result, policy assessment. Output logs alone cannot answer this.
7. Can you reconstruct the full context available to the model at the time of any production output?
Prompt reconstruction is required for incident analysis. The exact prompt — including all retrieved context, conversation history, and injected system instructions — must be recoverable. If this context cannot be reconstructed post-hoc, incident analysis will be incomplete.
Action Boundaries
8. Is there an explicit inventory of every action an AI system can trigger?
The action inventory is the surface area of the system’s real-world effect. Every entry should have a documented tier, verification requirement, and escalation path. New actions should require review before being added.
9. Are irreversible actions separated from reversible ones by an explicit architectural boundary?
Irreversible actions should require stricter verification than reversible ones. If the architecture does not make this distinction structurally — if both are reachable through the same code paths with the same verification requirements — the distinction exists only in intent.
Governance Process
10. Who owns verification architecture?
If verification ownership defaults to the team that builds prompts and selects models, it has no independent owner. Verification is reliability infrastructure. It belongs to the team that owns incident response, with its own budget and charter.
11. Is the eval set updated continuously with production inputs that expose gaps?
The gap between the eval set and production distribution is where failures occur. The eval set should grow continuously based on coverage signals from production. Static eval sets have a half-life.
12. Is there a defined process for responding to use case scope changes?
A use case that starts as informational can become operational as the product evolves. The tier should be reviewed whenever the downstream action scope changes. Tier changes should trigger a verification architecture review.