Part 1 · Agent Harness Engineering

Stop Blaming the Model

Why your AI agent is dumber than it should be, whether it's writing code, answering customers, or taking voice orders, and the discipline that actually fixes it.

May 21, 20268 minute readAI Engineering

Two stories. Same lesson. Different worlds.

Story one: the coding agent. In late August 2025, a three-person team at OpenAI sat down in front of an empty Git repository. Five months later, they shipped a beta product — about a million lines of TypeScript across 1,500 merged pull requests. Not a single line was written by a human. The engineers weren't coding. They were designing the environment that let an AI agent code reliably.

Story two: the e-commerce chatbot. A mid-size online retailer launched an AI shopping assistant powered by GPT-4o. The model handled product questions beautifully in testing. In production, it recommended out-of-stock items 30% of the time, hallucinated discount codes that didn't exist, and once told a frustrated customer to "check Amazon" for faster shipping. The model wasn't the problem — it genuinely understood products, pricing, and intent. The problem was everything around it: no real-time inventory check, no approved-promotions list, no guardrail against recommending competitors.

Both teams discovered the same thing. The model was never the bottleneck. The environment around the model was.

That environment now has a name: the harness. And designing it is the most underrated skill in AI engineering right now — whether you're building a coding agent, a customer-service chatbot, or a voice assistant that takes dinner orders.


What's an agent? A chatbot answers one question and stops. An AI agent runs in a loop: it thinks, calls a tool (search for a product, check inventory, run a command, look up an order), observes what came back, thinks again, and keeps going until a goal is met. Claude Code and Cursor are agents that write software. But the same pattern powers customer service bots, voice commerce assistants, and shopping copilots. The loop is the same. The tools change.


The equation

Here's the line that's quietly reshaping how serious teams build AI products, from Viv Trivedy:

Agent = Model + Harness. If you're not the model, you're the harness.

The model is the thing you already know — GPT-5, Claude Opus, Gemini. It generates tokens. You can swap it.

The harness is everything else. In a coding agent, that's the tool execution, the sandbox, the filesystem. In an e-commerce chatbot, that's the product search API, the inventory check, the promotions validation list, the return-policy rules, the escalation logic that hands off to a human when the customer is angry enough. In a voice assistant, that's the speech pipeline, the order confirmation loop, the payment-processing guardrails.

Everything that isn't the model is the harness. That is a lot.

The horse

The metaphor is older than software. A horse is powerful. Hook it to nothing, and it stands in a field. Strap a harness on it — the straps, the reins, the bridle — and you can plow a quarter-acre.

The model is the horse. The harness is what turns raw power into useful work. The discourse treating AI like a horse race ("is GPT-5 better than Claude Opus?") is watching the wrong race.

Three levels, not one

There are three layers of engineering around a model, each wrapping the one below:

Level 1 — Prompt engineering. Craft the instructions — system prompts, few-shot examples, chain-of-thought. You're optimizing what the model reads.

Level 2 — Context engineering. Control what goes into the context window and when — which product details to retrieve, how to compress conversation history. If prompt engineering is writing a good email, context engineering is deciding which attachments to include.

Level 3 — Harness engineering. The full application infrastructure: when context loads, which tools are available, which actions are permitted (can the bot issue a refund over $200?), how failures get recovered, how state persists across sessions, how the agent talks to users across channels. This is where the majority of production effort goes — and what this series is about.

Each level is necessary. None is sufficient alone. A brilliant prompt inside a weak harness is a demo. A decent prompt inside a strong harness is a product.

Three numbers that should change how you think

All three come from public reports in early 2026. In every case, the model didn't change:

01 — Vercel collapsed fifteen tools into one bash command. Task time dropped from 274 seconds to 77. Success rate jumped from 80% to 100%. Token usage fell 37%. Same model. The whole improvement came from the harness.

02 — A team moved from #33 to #5 on Terminal Bench by changing only the harness. Same model, same benchmark. Better context management, smarter tool design, tighter hooks.

03 — A fintech team stripped their framework stack and rebuilt a hand-rolled harness in three weeks. They had LlamaIndex, MCP, and a multi-stage RAG pipeline. In production: 14-second p95 latency and 30–40% irrelevant retrieval. They rebuilt with plain Python. Latency dropped to 3.2 seconds. Retrieval relevance climbed from 60% to 89%.

These numbers are from coding agents and fintech. But the same pattern plays out in every domain. An e-commerce team that redesigned their chatbot harness — adding a real-time inventory check, a promotions validation hook, and a competitor-mention guardrail — saw first-contact resolution rate jump from 45% to 78% and CSAT climb twelve points. Same model, same product catalog, same customers.

The "skill issue" reframe

The default reaction when an agent does something dumb is to blame the model. Wait until GPT-6. Once they fix instruction-following, this will work.

Harness engineering rejects that. Almost every agent failure is legible:

→ Your shopping chatbot recommended an out-of-stock item. That's a missing inventory-check tool in the harness.

→ Your chatbot gave a customer a 50% discount it wasn't authorized to give. That's a missing hook that validates promotions against an approved list.

→ Your voice assistant processed the same order twice when the caller repeated "yes." That's a missing idempotency check in the tool execution layer.

→ Your chatbot forgot the customer said "I'm allergic to nuts" three turns ago. That's a context management failure — the allergy info got pushed out of the window.

→ Your coding agent ran rm -rf on something important. That's a missing hook.

→ Your agent kept "finishing" with broken answers. That's a missing verification step in the loop.

None of those are model problems. They're configuration problems. The team at HumanLayer puts it bluntly:

"The model is probably fine. It's just a skill issue."

Where the skill lives is in the harness.

The ratchet

Mitchell Hashimoto — the engineer who first named the discipline — describes harness engineering as a ratchet:

Anytime you find an agent makes a mistake, you take the time to engineer a solution so the agent never makes that mistake again.

Then you don't take that solution out.

Think about what this means for a customer-service chatbot. Week one, the bot offers a refund above the order value — you add a hook that caps refunds at the order total. Week two, it recommends a product from a discontinued line — you add a catalog-freshness check. Week three, it tries to cancel an order that's already shipped — you add a fulfillment-status gate. Each customer complaint becomes a permanent fix in the harness. By week eight, the bot is quietly excellent — not because the model got smarter, but because the harness absorbed forty specific failures.

Every rule should be traceable to something that went wrong. If a rule isn't there because something broke, it's noise.

What this means for you

If you're new to building agents — whether you're building a coding agent, a customer-service chatbot, a voice assistant, or a shopping copilot — the practical takeaway is small and freeing:

You don't need to wait for a smarter model.

The teams shipping useful agents right now aren't waiting either. They're staring at three things: their context window, their tool list, and their feedback loops. They're treating the harness as a real engineering artifact — versioned in Git, instrumented, evaluated, iterated.

That's the whole game.


Coming up — Part 2: Inside the harness — the six pieces of scaffolding that turn a model into an agent.

Sources: Mitchell Hashimoto, Addy Osmani, HumanLayer, OpenAI, Vikas Sah, Bhavishya Pandit, and the awesome-harness-engineering list.