Ainary

Building in Public · Research

State of AI Agent Trust 2026

Why we published our trust research open source — and what the data says about the gap between deployment and governance.

We ran this study because we kept hearing the same phrase from operators: “We trust our model.” That sentence — unexamined — is the root cause of most AI deployment failures.

6%have a documented trust framework
25%use informal review only
>40%have no process whatsoever

The QuestionTrust is not a binary

A model can be trusted to draft but not to approve. Trusted with customer data from Germany but not from California. Trusted on Monday morning with a human in the loop; not trusted at 2 a.m. on an automated pipeline. Most teams apply trust all-or-nothing. We built a four-level trust taxonomy to make these distinctions legible, per task.

The four-level trust taxonomy

Evidence-based · Verifiable against a source. Low trust required.
Interpretive · Pattern recognition. Moderate oversight.
Judgment-dependent · Values or policy tradeoffs. Human in the loop.
Autonomous · Acts on incomplete information. Gate required.

An Evidence-based task is verifiable against a known source. An Autonomous task requires the system to act independently on incomplete information. Once a team classifies its tasks, the oversight requirements become obvious: an evidence task needs a source link; an autonomous task needs a gate. The taxonomy does not tell you what to decide — it tells you that a decision is required.

The DataThe 40% problem

The most striking finding was not the 6% figure. It was the 40%. Nearly half of teams deploying AI agents in production workflows have no review process — not informal, not documented, not anything. They ship and watch.

That is not a governance failure. It is an awareness failure: these teams do not know what to review.

Survey-based, n=214 teams across Europe and North America. Voluntary participation introduces self-selection bias toward more aware teams — which means the real numbers are likely worse. Healthcare and finance are underrepresented. Model confidence in these findings: 73%.

The DecisionWhy open source

The benchmark is available on GitHub. We made that decision on the first day. A trust framework that lives inside a consulting deck is not a trust framework — it is a sales tool. For the taxonomy to be useful it has to be auditable, forkable, and improvable by people who disagree with our initial categories.

The repository contains the benchmark rubric, a scoring spreadsheet, and three worked case studies. Fork it, adapt it, challenge our categories. If you find a task that does not fit this taxonomy, that is a research contribution — open an issue.

Every gap in this study is named, not hidden — confidence levels, sample bias, and underrepresented sectors are all on the record. That is the standard we hold our own research to, because it is the standard we ask your AI systems to meet.

Want this taxonomy applied to your actual workflows?

A structured pilot labels your tasks, names your gaps, and gives you a governance baseline in weeks, not quarters.

Research published open source by Ainary. Follow the work on Finite Matters, Florian Ziesche's newsletter.

About me: I'm Florian, former startup CEO. I raised millions for my cloud computer vision startup in Munich. Now I build AI systems that let me work like a ten-person team. More on LinkedIn or Substack.

← All essays