Models Become Commodity. The Harness Wins.
Models Become Commodity. The Harness Wins.
Prologue: Same Model. Different Magic. Why?
Everyone uses the same Claude Opus 4.6. Everyone uses the same GPT-5.4.
So why does one engineer ship hundreds of thousands of lines of code per month, while another is still copy-pasting prompts into a chat window?
The answer isn't the model. The model is tap water now. The real difference is the plumbing — what we now call the harness.
LangChain didn't touch the model. They rewrote the harness. Their Terminal-Bench 2.0 score jumped from 52.8% to 66.5%.1 Same model. Different game. This is the 2026 game.
I spent the last week studying a talk from a builder who shipped 50+ agents in a year. Halfway through, I realized I'd been spending my time on the wrong layer.
I – We Were All the Sinners
Remember the prompt engineering era of 2022 and 2023?
"The AI is fine. You're the one who wrote a bad prompt."
That was the vibe. If the answer was wrong, it was the user's fault. We all sharpened prompts in Notepad like a chef sharpens knives.
Then context engineering arrived. RAG. Memory. Tool calling. The system prompt and tooling started absorbing what used to be the user's burden. We entered an era where you could "throw something rough at it" and still get something decent back.
But if we'd stopped there, Claude Code, Manus, and Codex would not exist.
Because context engineering had one fatal weakness: rigidity. Set it once, you're done. Static prompts. Static indexes. Static evaluation sets. Consistent results — but not adaptive ones.
II – The Trap Where 95% Becomes 60%
The 2026 user no longer expects an instant answer on Enter.
We launch Claude Code, fire up Codex in the background, and go do something else. The longer the AI works alone, the more we like it. That's the appeal.
But arithmetic shows up.
A model with 95% per-step accuracy → about 60% success after 10 steps.[^2]
This isn't magic. It's just 0.95 to the 10th power. Compounding errors. That's why even the #1 model on Terminal-Bench 2.0 hovers around 65%.2
graph TD
subgraph Problem["🔴 Long-Run Accuracy Cliff"]
A[Step 1: 95%] --> B[Step 5: 77%]
B --> C[Step 10: 60%]
end
subgraph Solution["🟢 Harness Self-Correction"]
D[Step 1: 95% + Rule Patch] --> E[Step 5: 92%]
E --> F[Step 10: 88%]
end
The accuracy cliff of long-running tasks — that's the reason context engineering ran out of road. We needed a system that could correct itself between steps.
III – The Harness Is a Feedback Loop
Martin Fowler's April 2026 framing of harness engineering reduces to one word: feedback.3
| Era | Flow | What gets fixed |
|---|---|---|
| Prompt | Feed-Forward | The user fixes the prompt |
| Context | Feed-Forward + Thumbs-up | The engineer fixes the RAG/tools |
| Harness | Feed-Forward + Self-Correcting Feedback | The system fixes its own rules |
See the shift?
In the old eras, when the answer was wrong, we fixed the answer. In the harness era, when the answer is wrong, we fix the way the answer gets generated.
Concrete example. You tell the AI to plan an outdoor workshop on the third Friday of April. It rains. The AI silently decides to cancel. You wanted it relocated indoors.
The old playbook: rewrite the answer. The harness playbook: rewrite the rule.
"If rain or any disruption occurs, do not cancel — relocate indoors or ask me first."
Next time, the same mistake doesn't happen. The system compounds. This is what self-evolving means.
IV – Progressive Disclosure: Spend Tokens Like Money
Here's the trap everyone falls into. "I have 1 million tokens, let me fill all of them."
Wrong.
Look at Opus 4.6 on the MRCR v2 benchmark. At 200K tokens, ~4% accuracy loss. Fill it to 1M, and you lose ~14%.4 About 2% degradation per 100K tokens you add.
That's why a smart harness doesn't fill the context. It opens the context only when needed.
This is Progressive Disclosure. Even with skills, the agent reads only the frontmatter first. "Do I actually need to load this body?" the model asks itself. Context is always an expensive resource.
| Token Usage | Accuracy | Meaning |
|---|---|---|
| 200K | ~96% | 🟢 Sweet spot |
| 500K | ~90% | 🟡 Caution |
| 1M | ~86% | 🔴 14% loss |
If consistency is your goal, expand Always-On. If versatility is your goal, push everything to On-Demand. That's the first decision in any organizational harness design.
V – Forcing Team Skills Will Break You
By now you're thinking, "Let's build team-wide skills and roll them out to everyone."
It collapses inside a week. The speaker who delivered the source talk tried exactly that — and gave up in seven days.5
The reason is brutal in its simplicity. A harness is fundamentally a tool to reduce my work. Force someone else's harness on me, and I do the work twice — once my way, once your way.
That's why Anthropic's enterprise skill governance splits into three tiers.6
| Tier | Owner | Change freedom |
|---|---|---|
| 🧑 Personal Skill | The individual | Unlimited |
| 👥 Team Skill | Team consensus | Moderate |
| 🏢 Company Asset | Governance + Approval | Strict |
There is no master skill. It starts as a personal tool that reduces your work, becomes a team asset when the team agrees, and only becomes a company asset after governance reviews it. Bottom-up survives. Top-down dies.
💭 Questions to Sit With
-
How can self-evolving systems sync team skills across people without ballooning governance overhead in fast-moving feedback environments?
-
As markdown-based knowledge becomes the substrate of AI behavior, how does its adaptability reshape the user interface — and what does that mean for accuracy?
-
When harness engineering becomes a governance framework, how does individual skill assessment shift in environments where token accumulation drags accuracy down?
Drop your answers in the comments.
Conclusion: Model Is Commodity. Harness Is Identity.
"Models will become commodities. The way they operate will become the competitive advantage."
That single line is the rule of the 2026 AI game.
The model you use is the same model your colleague uses. And your competitor. The differentiation isn't inside the model — it's outside, in the harness wrapped around it.
Two things you can start today.
-
Open a personal rules file. Write down a mistake your AI made yesterday. One line of markdown is enough.
-
Open a feedback channel for your skills. When something fails, is there a place to record it?
Without these two, your AI a year from now will behave exactly like it did on day one.
"In the era where models are commodities, the shape of the plumbing we build is our identity."
Sources
If this resonated, share it with one friend who works with AI.
Footnotes
-
Evaluating Deep Agents CLI on Terminal Bench 2.0 | LangChain Blog ↩
-
Terminal-Bench Hard Benchmark Leaderboard | Artificial Analysis ↩
-
Harness Engineering for Coding Agent Users | Martin Fowler ↩
-
Source talk transcript, "Harness Engineering" track, 2026 Korean AI conference. ↩
-
Scaling Enterprise AI with Anthropic Agent Skills | Techkraft Inc ↩