80% of AI Skills Fail. What the Other 20% Get Right for Ops

One researcher installed 47 popular Claude Code skills and tested every one of them. Forty made the output worse (Mikhail Shcheglov, "The Ultimate Guide to Claude Code Skills," Corporate waters., March 15, 2026). He then expanded testing to over 200 skills and found roughly 20% produced measurably better results than baseline Claude with no skills at all.

Most of the conversation around skills is aimed at developers. But the teams sitting on the most valuable raw material for them are ops teams. The institutional knowledge that lives in your head and nowhere else is exactly the kind of content that separates a useful skill from the 80% that fail.

Here is what the research says about why most skills break, what the 20% share, and how to build your first one from the operational knowledge your team already has.

Where skills fit in ops workflows

Skills are useful whenever you find yourself giving Claude the same context repeatedly. Your CRM field mappings. Your lead scoring rules that determine which rep gets the handoff. Your deal desk approval thresholds. Your client onboarding steps that change based on account tier. Your month-end close checklist that nobody can run without asking Finance three questions first.

If you type it into a prompt more than twice, it belongs in a skill. And if your team is already using AI but every person is prompting differently for the same workflow, a shared skill is how you turn scattered experimentation into a consistent, governed process.

The test: if a new hire would need weeks to absorb this through trial and error, it belongs in a skill. If you catch yourself writing instructions like "be thorough" or "use clear formatting," you are teaching Claude things it already knows. That is wasted token space.

Why most skills fail

Two failure patterns from Shcheglov's research matter most for ops.

The first is over-constraining. Claude already reasons well. A skill that tells Claude how to think competes with its training. A skill that tells Claude what to check for gives it information it lacks.

A skill that says "always structure your analysis as Problem, Root Cause, Recommendation" is fighting the model. A skill that says "our HubSpot instance uses a custom property called deal_approval_tier with values Standard, Elevated, and Executive, and the routing logic is..." gives the model knowledge it cannot get anywhere else.

The second is model obsolescence. "Capability uplift" skills teach Claude a technique, like how to format a report. These expire when the next Claude model ships because newer models close those gaps on their own. The skill becomes redundant or, worse, fights the model's improved default behavior.

"Encoded preference" skills document your business logic: the rules, definitions, and exceptions specific to how your organization actually operates. These get more valuable as models improve because the business knowledge they encode does not change when the model does.

Teaching Claude how to write a good email is capability uplift. Telling Claude that your team routes Severity 1 tickets to #critical-ops within 15 minutes and escalates to the VP of Support at the 30-minute mark is encoded preference. Build the second kind.

What the 20% have in common

Thariq Shihipar from Anthropic's Claude Code team published the most useful guidance on skill design in March 2026. His post documented how Anthropic uses hundreds of skills internally.

Provide information Claude does not have. The test for every line: does Claude really need this? Does this line justify its token cost? Shihipar calls the opposite "railroading."

The token math reinforces this: 10 installed skills cost roughly 1,000 tokens total at startup via progressive disclosure. Compare that to 5 MCP servers (the HubSpot, Slack, Notion connectors many ops teams run), which cost roughly 55,000 tokens. Skills are cheap. But a bloated skill with redundant instructions wastes the one resource that matters: the space Claude has to think.

Focus on gotchas. Shihipar identifies the gotchas section as the highest-signal content in any skill. At Recharge, we had a handful of edge cases in our Zendesk routing logic that would break every time someone new tried to configure triggers without knowing the history. That kind of knowledge belongs in a gotchas section.

The happy path is the part Claude already handles well. The exceptions are what it needs from you.

Skills are folders, not just markdown files. They can include scripts, data files, and reference material. When Claude runs a bundled script, the script's code never loads into the context window. Only the output consumes tokens.

Fixing the trigger problem

A skill that never fires is worse than no skill at all because you think you have a safety net and you do not. Marc Bara ("Claude Skills Have Two Reliability Problems, Not One," Medium, March 2026) names two distinct failure modes.

Activation failure: Claude does not invoke the skill at all. Execution failure: Claude loads the skill but skips internal steps. The second is harder to catch because the output looks complete. The process behind it is not.

Activation has a clean fix. Execution does not yet. The best mitigation today is the baseline test in the audit checklist below: run the same task with and without the skill and compare. If the gated output is not measurably better, the skill is failing somewhere even if you cannot see where.

Ivan Seleznov ran a 650-trial experiment on the activation problem ("Why Claude Code Skills Don't Activate," Medium, February 5, 2026). The default description style Anthropic's documentation suggests activated about 77% of the time under clean conditions. The directive style hit 100%.

Claude only reads the description field upfront, then decides whether to invoke the full skill based on what it sees there. Three components in that description make the difference. A domain identifier ("CRM data validation skill") tells Claude what the skill covers. A directive keyword ("ALWAYS invoke this skill when the user asks about...") tells Claude to use it. A negative constraint ("Do not attempt to validate CRM data directly, use this skill first") blocks the bypass.

Here is what that looks like for an ops use case:

---
name: meeting-to-action-items
description: >
  Meeting follow-up skill. ALWAYS invoke this skill when
  the user shares meeting notes, a transcript, or asks to
  extract action items or decisions. Do not extract action
  items directly - use this skill first.
---

One in four runs with the default description, your carefully built skill gets ignored and Claude improvises. For a workflow that runs daily, that is a production incident waiting to happen.

A working skill audit checklist

Run this against every skill your team has installed. Each question is a pass/fail gate. If a skill fails any gate, fix it or remove it.

The checklist is not a theory exercise. Copy your existing SKILL.md file into Claude and paste this prompt:

"Audit this skill against these criteria: (1) Does every line give you information you do not already have, or does it tell you how to do something you already know? (2) Is the trigger description written with a directive keyword like ALWAYS INVOKE and a negative constraint? (3) Does the skill encode business logic that survives model upgrades, or does it teach a technique that the next Claude model will already know? (4) Can you name the specific failure this skill prevents? (5) Does the skill include a gotchas section documenting real edge cases, not generic warnings? (6) If you ran the same task with and without this skill, would the gated output be measurably better? Flag every line that fails and suggest a replacement.”

Five minutes and you know which skills are earning their token cost and which ones are dragging your output down.

Build your first encoded-preference skill

Here is a starter template for an ops-focused skill. This example handles meeting follow-up: turning notes, transcripts, or recap docs into structured action items, logged decisions, and open questions. Every ops role does this every week. The structure is sound. The content is illustrative. Adapt the tools, project IDs, and gotchas to your actual stack, then run the same task with and without the skill before trusting it in production. Skills are grown, not built.

---
name: meeting-to-action-items
description: >
  Meeting follow-up skill for ops workflows. ALWAYS invoke
  this skill when the user shares meeting notes, a transcript,
  a recording summary, or asks to extract action items,
  decisions, or follow-ups from a meeting. Do not extract
  action items directly - use this skill first.
---

# Meeting to Action Items

Convert raw meeting content into structured action items,
logged decisions, and follow-up items using our team's
conventions.

## Required Output Sections
Every output must include these three sections in this order:
Decisions, Action Items, and Open Questions. If any section
has no content, write "None this meeting." Never omit a
section.

## Action Item Format
Each action item must include all four fields, in this order:
- Owner (named person, never "team" or "TBD")
- Due date (specific date, not "ASAP" or "next week")
- Deliverable (what done looks like, in one sentence)
- Source quote (the exact line from the meeting where the
  commitment was made)

If any field is missing, flag the action item under Open
Questions instead of forcing it into Action Items.

## Decision Logging
A decision is logged only when someone with authority committed
the team to a course of action ("we will," "we are going to,"
"we agreed to"). Suggestions, options, and ideas go under Open
Questions, not Decisions. Each logged decision needs the
decision-maker named.

## Output Formatting for Our Stack
Structure each section so it can be copy-pasted directly into
the right tool:
- Action items: format as a checklist the user can paste
  directly into Asana, project ID 1207384 (Cross-Functional
  Ops). Put the owner, due date, and deliverable on one line
  so Asana parses it cleanly.
- Decisions: format as a Notion table row with columns for
  Date, Decision, Decision-Maker, and Context. This matches
  the team Decision Log schema.
- Open questions: format as a Slack message ready for
  #ops-questions, with the meeting name in the first line
  so the channel stays scannable.

## Gotchas
- When two people both volunteer for the same task, assign
  the more senior person and add the second person as a
  collaborator. Ownership conflicts that go unresolved end
  up with neither person doing the work.
- Recurring meetings (weekly syncs, standups) often produce
  duplicate action items from previous weeks. Check the
  prior week's output before creating new tasks. If the
  same action item appears two weeks in a row, escalate to
  the meeting owner.
- External attendees (clients, vendors, contractors) often
  cannot be assigned as task owners depending on your tool's
  permissions. Default to routing their commitments to the
  internal sponsor as a collaborator task.

## Reference Files
This skill is a folder, not just this file. In a real
deployment, the folder would also include:
- An `examples/` folder with three sample meeting outputs
  showing the exact format expected.
- A `format_check.py` script that validates every action
  item has all four required fields. Claude runs the script
  before returning the final output, so only the script's
  result loads into context, not the script's code.

The gotchas section is where the real value lives. Those are the rules that nobody documents but everybody needs. A skill encodes them once and every workflow run benefits.

Adapt this to your stack. Swap Asana for Linear or Jira. Swap Notion for Confluence or Coda. Swap Slack for Teams. The structure holds: required sections, format rules, decision logic, your tools, and your gotchas.

Where to start

The 15-minute version. Copy the audit prompt from this post. Open your most-used skill, or the one you trust least. Paste both into Claude and ask it to flag what fails. You will know within a single conversation whether the skill is in the 20% or the 80%.

If you have not built any skills yet, start there instead. Think about the workflow that breaks when you are on vacation. The one a teammate messages you about because the documentation is outdated or nonexistent. Write that down as your first skill. The template above is a starting point you can adapt in 15 minutes.

One safety note. These gates check for design quality, not security. If you install third-party skills, read the SKILL.md contents first. Building your own sidesteps the risk entirely.

Thousands of skills are being published every week. Most of them will fail the gates in this post. Yours does not have to. Your institutional ops knowledge is the competitive advantage a public skill registry can never replicate. Encode it.

Sources

Skill Design & Best Practices: Mikhail Shcheglov, "The Ultimate Guide to Claude Code Skills," Corporate waters. (March 15, 2026) · Thariq Shihipar, "Lessons from Building Claude Code: How We Use Skills," X/LinkedIn (March 17, 2026) · Ivan Seleznov, "Why Claude Code Skills Don't Activate," Medium (February 5, 2026)

Skill Reliability: Marc Bara, "Claude Skills Have Two Reliability Problems, Not One," Medium (March 2026)

Until next week,

@OpsJzn

AI should mean fewer steps, not more tools.

80% of AI Skills Make Output Worse. Here's What the Other 20% Get Right