Part 2 · Agent Harness Engineering

Inside the Harness

The six pieces of scaffolding that turn a model into an agent, with examples from coding agents and e-commerce chatbots alike.

June 4, 20269 minute readAI Engineering

In Part 1, we landed on the equation: Agent = Model + Harness. We saw coding agents and e-commerce chatbots arrive at the same conclusion — the model was never the bottleneck. Today we open the scaffolding up.

There are six pieces inside. Every one exists because the model can't do something on its own.

Quick recap: An AI agent runs in a loop — think, call a tool, observe, think again. The model generates tokens. The harness is everything else: tools, context, memory, hooks, sandboxes, sub-agents. Whether it writes code, answers customers, or takes voice orders — same pattern.


The backbone: the loop

Every agent is a loop — the model receives the current state, reasons about what to do, calls a tool, observes the result, and repeats. The technical name is ReAct (Reasoning and Acting).

A concrete example. A customer messages: "I ordered a blue sweater last week and it arrived in the wrong size. Can I exchange it?"

— The model reasons it needs order details. It calls check_order_status. — Tool returns: Order #4827, blue sweater, size M, delivered Tuesday. — It reads the return policy from store_policies.md — exchanges within 30 days, tags attached. — It reasons the exchange is valid, calls initiate_exchange, confirms with the customer, and stops.

Four turns. Same pattern as a coding agent fixing a test — different tools.

The loop can spin forever if you let it. One practitioner watched a chatbot spend $12 in API calls searching for a product removed from the catalog, generating increasingly creative queries for 45 turns. The harness enforces hard limits: maximum iterations, a wall-clock timeout, a cost ceiling. That's harness work, not model work.

The design principle

Every component in a harness exists because the model can't do something on its own. As models get better, some pieces become obsolete. The gap closes. The piece comes out.

Can't remember past its context window? → filesystem — Can't affect the world? → tools and a sandbox — Can't keep its head straight as the window fills? → context management — Doesn't know your store policies? → memory files — Can't be trusted not to issue a $500 refund? → hooks — Can't hold a task too big for one head? → sub-agents


01. The filesystem

A model can only operate on what fits inside its context window. Without a place to put things down, the window fills, the model gets dumber, the task fails.

For coding agents, the filesystem is where the agent reads code, commits via Git, and rolls back mistakes. For e-commerce chatbots, it's where session transcripts persist so returning customers don't re-explain their problem; where order histories are logged; where progress files track multi-step returns across sessions.

What breaks without it: the agent forgets the moment its context fills up. Multi-session work is impossible.

02. Tools, sandbox, and the tool cliff

A model alone can't do anything. To affect the world, it needs tools: functions the model can ask the harness to call.

For coding agents, the power tool is bash. For e-commerce chatbots, tools are: search_products, check_inventory, check_order_status, process_return, apply_discount. Each maps to a real backend API.

The tool cliff: 10 tools performs well, 30 noticeably degrades, 107 is total failure. Attention fragmentation, not context exhaustion, is the bottleneck. GitHub Copilot addressed this with a routing model that triages 40 tools down to 13 per request.

Tools also need a sandbox. For coding agents: Docker or a remote VM. For chatbots: the agent should hit a staging endpoint, not production, and every tool call should be idempotent or reversible.

What's an MCP server? MCP — Model Context Protocol — is the standard for plugging external tools into an agent. A Stripe MCP for payments, a Shopify MCP for inventory, a Zendesk MCP for tickets. Treat MCP servers like npm packages: only install what you actually use.

What breaks without it: the model can think but can't act. Or it can act, but there's nothing stopping it from charging a customer twice.

03. Context management

Models get measurably worse at reasoning as their context window fills up. It's structural — the longer the context, the more diluted any one instruction becomes.

For coding agents, context rot means forgetting the goal after 30 tool calls. For e-commerce chatbots, it means forgetting a customer said "I'm allergic to nuts" five turns ago.

The KV cache matters economically: $0.30/M cached tokens vs. $3.00/M uncached — roughly 10x. That ratio determines whether your chatbot costs $500/month or $5,000/month at scale.

Good context management means: append-only (never reorder older context), deterministic serialization (json.dumps(data, sort_keys=True)), no second-precision timestamps in the system prompt, compaction (summarize old turns when full), and recitation (maintain a ticket_state.md the agent re-reads before each action).

Three memory layers the harness orchestrates:

Layer 1 — Filesystem (long-term). Progress files, session transcripts, order history. Survives across sessions.

Layer 2 — RAM (short-term). Conversation history, tool results, intermediate state. Fast but volatile.

Layer 3 — Context window (active). What the model sees right now. Critical rule: always flush state to disk before discarding it from the window.

What breaks without it: halfway through a long task, the agent gets lost or quietly declares it's done when it isn't.

04. Memory and skills

"Memory" reduces to: what do you put into context, and when?

For coding agents, it's CLAUDE.md or AGENTS.md. For e-commerce chatbots, it's store_policies.md — return windows, shipping tiers, escalation triggers, approved discount codes, tone guidelines.

Two hard-won lessons: keep it short (under 60 lines — ETH Zurich found LLM-generated config files actively hurt performance and cost 20%+ more tokens), and earn each line (every rule traceable to a real failure).

Skills are markdown files loaded only when relevant. For e-commerce: returns-skill.md loads when the customer mentions a return, sizing-skill.md for fit questions. Small system prompt, relevant context.

What breaks without it: the agent ignores your store policies and learns nothing from last week's complaints.

05. Hooks

Hooks are where harness engineering stops being polite and starts being deterministic.

"Telling an agent 'follow our coding standards' in a prompt is probabilistic compliance. Wiring a linter that blocks the PR when standards are violated is a deterministic constraint." — Augment Code

For e-commerce: telling the chatbot "don't recommend competitors" in the prompt is probabilistic — the model might forget on turn 12. A hook that scans the outgoing response for competitor names before it reaches the customer is deterministic. It can't forget.

Four hook lifecycle points that matter:

Pre-tool: validates before an API fires. Cap refunds, check discount codes, enforce dollar thresholds. — Post-tool: verify API responses. Did the inventory check return stale data? — Pre-response: scan the outgoing message for PII, competitor mentions, or hallucinated promises before the customer sees it. — On-stop: when the agent tries to finish, verify the output is consistent with store policy. If not, bounce it back (back-pressure).

Permission tiers: Allow (runs silently: check order status, search products) → Deny (blocked unconditionally: delete accounts, override pricing) → Ask (requires human approval: refunds over $100, non-standard discounts).

# Hook: block refunds above order value. Block unapproved discounts.
def before_tool_call(name, args, order_ctx):
    if name == "process_return":
        if args["refund_amount"] > order_ctx["order_total"]:
            raise BlockedAction("Refund exceeds order value.")
    if name == "apply_discount":
        if args["code"] not in APPROVED_CODES:
            raise BlockedAction("Unapproved discount code.")

The operating principle: success is silent, failures are verbose. The deeper pattern is back-pressure: hooks that force the agent to keep working until its output is verified. The agent can't declare itself done until the harness agrees.

What breaks without it: refunds it shouldn't, discounts that don't exist, competitor mentions, broken output — because the rules were advisory, not enforced.

06. Sub-agents

Sub-agents are not "a product-search agent and a returns agent that collaborate." One team built five specialized agents; orchestration overhead added 800ms per handoff. Collapsing to one agent cut task time from 45 seconds to 18.

Sub-agents are for context control. A sub-agent runs a discrete task and returns a short summary. HumanLayer calls this a context firewall. For e-commerce: a sub-agent does a complex product comparison across 50 items and returns only the top 3, keeping the parent's context clean.

The rule: always start with one well-harnessed agent. Only reach for sub-agents when you have evidence that a single context can't hold the task.

What breaks without it: by turn 40, the context is mostly catalog data and the agent can't remember whether the customer wanted the blue one or the green one.


Five findings that look wrong until you measure them

1. More tools is worse, not better. 10 tools = perfect. 107 = total failure.

2. LLM-generated config files actively hurt. ETH Zurich: LLM-written ones cost 20%+ more tokens for worse results.

3. Maxing reasoning everywhere is dumber than mixing. "Reasoning sandwich" (xhigh on planning, high on execution) beats xhigh everywhere: 66.5% vs. 53.9%.

4. Don't scrub failed attempts. The next attempt makes the exact same mistake. Error evidence is signal.

5. A single complex task can cost $5–$15 in API calls. For a 20-person team: $1,000–$3,000/day. Prompt caching reduces it 40–60%, but only with careful harness engineering.


Coming up — Part 3: Building your first agent harness in Python — an e-commerce customer-service bot in about a hundred lines.

Sources: Mitchell Hashimoto, Addy Osmani, HumanLayer, OpenAI, Vikas Sah, Bhavishya Pandit, and the awesome-harness-engineering list.