Human in the Loop Is Not a Policy: AI Review for Ops Teams

Air Canada lost a tribunal case in 2024 because its chatbot told a customer something that was not true.

The airline’s defense was that the chatbot was “a separate legal entity that is responsible for its own actions.” The British Columbia Civil Resolution Tribunal called this “a remarkable submission” and ordered Air Canada to pay the customer $812.02 in damages, interest, and fees.

Whatever process Air Canada had in place, it did not catch the error before the customer saw it.

Bad AI output is still your output.

This is the part nobody wants to say out loud. “Human in the loop” is not a review policy. It is a phrase ops teams use to make leadership feel comfortable when they have not actually designed how AI gets reviewed.

Why the phrase fails as a policy

A policy tells someone what to do on Monday morning. “Human in the loop” tells you nothing.

It does not say who reviews. It does not say when. It does not say what they are looking for. It does not say what triggers an escalation. It does not say what stays on the human’s desk versus what gets sent automatically.

The result: every team has a slightly different version of the loop, and most of those versions are theater.

I have seen this pattern in the AI workflows I have built and tested. The workflow gets set up. The team is told to “review the output.” Nobody defines what review means. Once the workflow starts seeing real volume, people handle it differently. Some skim and click approve. Others review hard at first and then speed up. An obvious error eventually ships. The team blames the AI. The AI was doing exactly what it was told.

What the research shows about review

Human factors researchers have a name for this. It is called automation bias, and the numbers are not subtle.

In a 1999 study published in the International Journal of Human-Computer Studies, researchers Skitka, Mosier, and Burdick measured what happened when humans reviewed automated decisions. People given an automated aid missed 41% of critical events when the automation failed to flag them. The control group, working without automation, missed only 3%.

Adding a “human in the loop” did not make people better at catching errors. It made them measurably worse in that study. The presence of the aid created a complacency that working without it did not.

A follow-up study from the same research team in 2000 found that accountability lowered rates of automation bias. That matters for ops teams. Review works better when someone expects to justify the decision, not just click approve.

The finding holds up outside the lab. A 2012 systematic review in the Journal of the American Medical Informatics Association found that clinical decision support systems often improve overall performance but can introduce new errors when users over-rely on them.

This is why “human in the loop” without further definition is not just vague. It implies a safety the data does not support.

Three questions that turn vague oversight into a real review model

The fix is not more reviewers. It is review designed to match the workflow.

For every AI-touching workflow your team runs, ask three questions.

1. Can you take it back?

A draft email saved to a folder is reversible. A sent email is not. A CRM field updated overnight in a batch is reversible if you catch it the next morning. A refund issued to a customer is not. A workflow that produces something you can undo deserves a different review model than one that produces something you cannot.

2. Does someone outside your company see it?

An internal Slack summary affects nobody but your team. A response sent to a customer affects your brand, your support metrics, and potentially your legal exposure.

Same model. Different blast radius.

3. What does it touch?

Money. Identity. Regulated data. Contractual terms. Personnel decisions. These sit in different territory than tagging a ticket or summarizing a meeting. The review burden should match the worst plausible damage.

These three questions are not the only ones that matter. They are the three ops teams skip most often. Answering them takes ten minutes per workflow. The result is a review model you can actually staff and audit.

What review looks like at four workflow tiers

Once you have answered the questions, you can map the workflow to a review model. Here is the matrix I would hand to a team on day one.

One note on two-person approval. Same-team double-review tends to collapse into one reviewer trusting the other. Cross-functional approval catches more because the second reviewer is checking for something different. Ops plus Finance catches exposure an Ops reviewer will miss. Ops plus Legal catches language an Ops reviewer will miss. That is the deal desk lesson most teams skip when they copy approval chains from elsewhere.

This is not theoretical. AWS published a healthcare example on April 8, 2026 that uses the same logic inside one agent: looking up a patient’s name can run without approval, retrieving vitals or medical conditions requires human authorization, and patient discharge requires external supervisor approval.

Same agent. Different gates.

The mistake most teams make is applying the same review model to every workflow. That mistake is expensive in two directions. It bottlenecks low-risk work that should be sampling-only. It under-reviews high-risk work that should be approval-gated.

The trap of reviewing everything

There is a reason “review everything” is not the safe default. It feels safe. It is not.

When humans review high volumes of mostly-correct AI output, they get bored. Bored reviewers become complacent reviewers. Complacent reviewers approve faster than they read. The 41% omission rate from the 1999 study did not happen because the reviewers were unqualified. It happened because they trusted the automation to flag the things that mattered, and the automation did not.

Designing review to match the workflow is not about doing less review. It is about doing review the human can actually perform with attention. A reviewer who checks five things on three high-stakes outputs per day will catch more than a reviewer who skims fifty low-stakes ones.

Review attention is a finite resource. Spend it where the worst plausible error is biggest. Sample everywhere else.

What teams get wrong

One review policy for every workflow.
A meeting summary and a refund do not deserve the same checklist. Uniform rules bottleneck low-risk work and under-review high-risk work at the same time.

No named owner.
If “the team” reviews something, nobody reviews it. Name a role. Put it in the workflow doc. Make it a Monday-morning-clear responsibility.

Treating a glance as a control.
If the reviewer does not know what they are checking, what counts as a stop, or when to escalate, nothing is actually being controlled. Write the checks down.

Starting AI rollout on the highest-stakes workflow.
Teams want to show impact fast, so they point AI at the workflow with the highest stakes. That is backwards. Start where errors are cheap to fix. Earn the right to move up.

Where to start this week

Pick one workflow.
High volume, low stakes. The kind where errors are caught and fixed before they ship.

Answer the three questions.
Be honest about reversibility, visibility, and sensitivity.

Name the reviewer.
Not “the team.” A role, a person, a Monday-morning-clear owner.

Write the checklist.
Four to six items. Specific enough to actually use. “Does the amount match the approved range?” beats “Review the output.”

Define the stop rule.
What sends this to a human, manager, or specialist immediately, regardless of AI confidence?

Run it for one week.
Then look at what the reviewer actually caught versus what they let through. Adjust the checklist.

Log overrides.
If humans keep fixing the same issue, the workflow is telling you where the prompt, template, source logic, or guardrail is weak.

You will know it is working when you can hand the document to a new team member and they can do the review without asking what “review” means. That is the bar.

A vague “human in the loop” survives leadership scrutiny because it sounds responsible. A defined review model survives a tribunal.

Sources

Research: Skitka, Mosier & Burdick (1999), Does automation bias decision-making? International Journal of Human-Computer Studies · Skitka, Mosier & Burdick (2000), Accountability and automation bias, International Journal of Human-Computer Studies · Goddard, Roudsari & Wyatt (2012), Automation bias: a systematic review of frequency, effect mediators, and mitigators, Journal of the American Medical Informatics Association

Legal: Moffatt v. Air Canada, 2024 BCCRT 149

Operational guidance: AWS Machine Learning Blog, Human-in-the-loop constructs for agentic workflows in healthcare and life sciences (April 8, 2026)

Until next week,

@OpsJzn

AI should mean fewer steps, not more tools.

Human in the Loop Is Not an AI Governance Policy