We ran this study because we kept hearing the same phrase from operators: “We trust our model.” That sentence — unexamined — is the root cause of most AI deployment failures.
The QuestionTrust is not a binary
A model can be trusted to draft but not to approve. Trusted with customer data from Germany but not from California. Trusted on Monday morning with a human in the loop; not trusted at 2 a.m. on an automated pipeline. Most teams apply trust all-or-nothing. We built a four-level trust taxonomy to make these distinctions legible, per task.
The four-level trust taxonomy
An Evidence-based task is verifiable against a known source. An Autonomous task requires the system to act independently on incomplete information. Once a team classifies its tasks, the oversight requirements become obvious: an evidence task needs a source link; an autonomous task needs a gate. The taxonomy does not tell you what to decide — it tells you that a decision is required.
The DataThe 40% problem
The most striking finding was not the 6% figure. It was the 40%. Nearly half of teams deploying AI agents in production workflows have no review process — not informal, not documented, not anything. They ship and watch.
That is not a governance failure. It is an awareness failure: these teams do not know what to review.
Survey-based, n=214 teams across Europe and North America. Voluntary participation introduces self-selection bias toward more aware teams — which means the real numbers are likely worse. Healthcare and finance are underrepresented. Model confidence in these findings: 73%.
The DecisionWhy open source
The benchmark is available on GitHub. We made that decision on the first day. A trust framework that lives inside a consulting deck is not a trust framework — it is a sales tool. For the taxonomy to be useful it has to be auditable, forkable, and improvable by people who disagree with our initial categories.
The repository contains the benchmark rubric, a scoring spreadsheet, and three worked case studies. Fork it, adapt it, challenge our categories. If you find a task that does not fit this taxonomy, that is a research contribution — open an issue.
Every gap in this study is named, not hidden — confidence levels, sample bias, and underrepresented sectors are all on the record. That is the standard we hold our own research to, because it is the standard we ask your AI systems to meet.