AI's Biggest Gains Come from Your Hardest Work, Not Simplest

Tasks requiring a college education get a 12x speedup from AI. Tasks requiring a high school education get 9x (Anthropic Economic Index, January 2026).

That ratio should change how every ops team prioritizes its AI roadmap. The conventional playbook says start simple: automate field updates, route tickets by keyword, trigger SLA timers. Then work your way up to the complex stuff. Anthropic's data, drawn from millions of real Claude conversations sampled across five reports, says the opposite. The biggest productivity gains come from the hardest judgment calls, not the easiest rule-based tasks.

Here is what that means for the ticket queue you manage, the CRM you administer, and the deals you route.

Why the hardest work benefits most

The Anthropic Economic Index is the most granular public dataset on actual AI usage ever released. Starting in February 2025, Anthropic has published five reports analyzing millions of real Claude conversations, mapped against the US Department of Labor's O*NET database of nearly 20,000 work tasks. This is not survey data. This is not a vendor projection. This is what people actually do with AI, measured at scale.

The January 2026 report introduced task-level success rates and complexity metrics for the first time. Here is the part that should change your priorities: complex tasks generate larger productivity gains than simple ones, even though AI succeeds slightly less often on them (66% success rate on college-level tasks versus 70% on the simplest tasks requiring less than a high school education, as of January 2026).

A field experiment by Maršál and Perkowski at the National Bank of Slovakia reinforces this from a different angle. Testing 101 staff members on GPT-4o, AI achieved a +58% performance improvement on non-routine tasks compared to +24% on routine tasks (CEPR VoxEU, July 2025). Lower-skill workers got the biggest quality improvements. Higher-skill workers got the biggest time savings.

Two different research teams, two different methodologies, the same conclusion: the biggest gains do not come from automating the work your current tools already handle. They come from the judgment calls that previously required your most experienced team member.

Where AI makes your team actively worse

The gains are real. The failures are equally real, and more dangerous because they look like successes.

The BCG/Harvard Business School study tested 758 consultants on realistic tasks using GPT-4. The study was originally released in 2023 and formally published in Organization Science in March 2026 after peer review. On tasks within AI's competence boundary, consultants using AI completed 12.2% more tasks, 25.1% faster, at more than 40% higher quality. On a single task outside that boundary, consultants using AI were 19 percentage points less likely to produce correct solutions than those working without AI (Dell'Acqua et al., Organization Science, 2026).

The AI did not give obviously wrong answers. It gave confidently wrong answers that humans failed to catch. That is the failure mode every ops team needs to internalize: AI does not flag its own mistakes. It presents them with the same polish as its best work.

Anthropic's own reliability data confirms the pattern. Claude succeeds on 67% of Claude.ai conversations and 49% on API calls (Anthropic Economic Index, January 2026). The gap exists partly because API calls tend to be single-turn automated requests with no human correction loop. The success rate represents Claude's self-assessment, validated against user feedback, and Anthropic acknowledges it likely overstates true capability because users avoid tasks they expect to fail.

For ops teams, the platform gap matters. If you are embedding AI into an automated pipeline with no human in the loop, you are operating at the 49% success rate, not the 67% rate. Multi-turn conversations where a person iterates with AI outperform fire-and-forget automation. Design accordingly.

The broader market data is even more sobering. Among Salesforce customers surveyed by IBM, only 33% of AI initiatives are meeting ROI targets, and 53% cite poor data availability and quality as the top barrier to agentic AI adoption (IBM State of Salesforce Report, 2025-2026). Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, and inadequate risk controls (Gartner, June 2025).

That last number is the one that matters most. AI performance is capped by your data quality, not by model capability. A 12x speedup means nothing if the CRM data feeding the model has duplicate records, inconsistent field values, and six different spellings of the same company name.

How to split your workflows: rules versus judgment

The research points to a specific architecture for ops teams. Rule-based automation as the structural backbone. AI judgment layered on top for tasks that require pattern recognition across unstructured data.

I reviewed 10-15 deal submissions daily on VMware's Deal Desk. The approval routing was pure rules. Discount over 40%, send to Sales Director. That part never needed AI. The part that ate my time was reading deal notes and figuring out whether a discount request was a competitive situation, a renewal risk, or a rep gaming the system. That is judgment work. That is where a 58% productivity improvement actually changes your week.

This matters more than it sounds. The tools are different. The failure modes are different. And the cost of getting it wrong goes in opposite directions.

Rule-based automation (Zapier, Make, n8n, native CRM workflows) handles deterministic processes: if deal size exceeds $100K, route to VP approval. If ticket contains keyword "billing," assign to billing queue. If SLA timer hits 4 hours, escalate. These are your if/then conditions. They execute at 99%+ reliability because the logic is explicit. Traditional automation already handles this work. Pointing AI at it is a downgrade.

AI judgment handles the work that rules cannot express: reading a deal loss reason note and classifying whether the real issue was pricing, timing, or competitor feature gaps. Scanning 50 support tickets and identifying the three that signal churn risk based on tone, not keyword. Detecting that "IBM," "International Business Machines," and "IBM Corp" are the same account. These tasks require contextual reasoning across unstructured data, and they represent the +58% productivity improvement zone from the Maršál and Perkowski experiment.

Customer service representatives already show 70.1% observed AI exposure, the second-highest of any occupation in Anthropic's data (Massenkoff and McCrory, March 2026). Data entry keyers reach 67.1%. Business sales and outreach automation at least doubled its share of enterprise API traffic between November 2025 and February 2026 (Anthropic Economic Index, March 2026). The infrastructure wave is already here. The question is whether your team is pointing it at the right tasks.

Where to start

Audit your judgment calls first, not your field updates. List the tasks on your team where an experienced person makes a call that a new hire could not: ticket escalation decisions, deal risk assessments, data quality reviews, exception handling. Those are your highest-ROI AI targets.

Check your data before you check your model. If 53% of Salesforce customers cite data availability and quality as their top barrier to agentic AI, assume your data has the same problems until you prove otherwise. Run a deduplication pass. Audit your free-text fields for consistency. AI amplifies whatever it reads, clean or messy.

Build the human checkpoint. The BCG jagged frontier finding was peer-reviewed, published in Organization Science in March 2026, and remains the best experimental evidence on what happens when AI works outside its competence boundary. Models have improved since the original GPT-4 experiment, but the core problem has not changed: AI does not announce when it crosses into territory where it will hurt more than it helps. Your workflow needs a review gate between AI output and final action, especially for any task where a wrong answer has consequences.

Keep rules for rules, judgment for judgment. Your Zapier workflows, SLA triggers, and round-robin assignments already work. Do not replace them with AI to seem innovative. Layer AI on top for the classification, prioritization, and pattern recognition work that rules cannot reach.

The data across every major AI usage study points the same direction. The biggest productivity multiplier lives in your hardest work, not your simplest. The teams that figure this out first will not just work faster. They will make better decisions at the speed their competitors are still processing tickets.

Sources

AI Usage & Productivity: Anthropic Economic Index, January 2026 (Economic Primitives) · Anthropic Economic Index, March 2026 (Learning Curves) · Massenkoff & McCrory, "Labor Market Impacts of AI," March 2026

Task Performance & Experiments: Maršál & Perkowski, National Bank of Slovakia field experiment, CEPR VoxEU, July 2025 · Dell'Acqua et al., "Navigating the Jagged Technological Frontier," Organization Science, March 2026 (BCG/Harvard Business School)

AI Adoption & ROI: IBM State of Salesforce Report 2025-2026 · Gartner agentic AI predictions, June 2025

Until next week,

@OpsJzn

AI should mean fewer steps, not more tools.

What Millions of AI Conversations Say About Where Ops Teams Should Use AI

Why the hardest work benefits most

Where AI makes your team actively worse

How to split your workflows: rules versus judgment

Where to start

Reply

Keep Reading