What Is an Agent Harness? The Runtime Between Your Model and Real Work

The most-asked question we got at AU buyer meetings this quarter was not “which model?” It was “what is the harness?” — and how much of it the customer has to own.

The definition that finally stuck

Through 2024 and early 2025, “agent” meant whatever the speaker wanted it to mean — a model call with tools, a flow chart of nodes, a chatbot with a vague mandate to “do work”. The term that converged the industry was harness. Martin Fowler, Anthropic’s engineering blog and OpenAI’s Codex team all published variants of the same picture in the second half of 2025 and into 2026. The formula they agreed on:

Agent = Model + Harness.

The model contributes intelligence — token prediction conditioned on a prompt. The harness contributes everything else an agent needs to do useful work: a running orchestration loop, a curated set of tools, a way to remember things between sessions, context discipline so the window doesn’t blow up, a verification signal so wrong answers don’t silently ship, approval gates for sensitive actions, observability so an operator can debug a bad run, and routing so the right model gets the right task at the right price.

Two practical consequences fall out of the formula. First, models are interchangeable in a way most organisations underestimate. We swap the underlying model on customer workflows every quarter — Claude Opus to GPT-5.5 to DeepSeek V4-Pro to Gemini 3.1 — and most of the harness keeps running. Second, the harness is where almost every real failure mode lives. Hallucinated answers usually trace back to thin context, not a bad model. Stalled queues trace back to brittle tool wiring. Bills that triple overnight trace back to a routing layer that never got built.

The eight components, what they actually do

Different sources count between seven and eleven components. Eight is the smallest set that names the load-bearing pieces without collapsing distinctions that matter in production.

1. The orchestration loop

Almost every working harness in 2026 runs a ReAct loop: reason → act via a tool → observe the result → loop until done or budget exhausted. This is the heartbeat. Without it the model is a one-shot text completion. With it the model becomes the planner of a sequence of real-world actions. The loop is also where the cost lives — every iteration is another inference call.

2. The tool registry

Tools are the model’s actuators. File read and write, shell or bash execution, HTTP calls, database queries, the typed adapters for CRM, helpdesk and email systems the customer actually uses. The registry decides which tools exist for which agents, validates arguments, dispatches the call, and feeds the result back into the context. Tool design is where you encode policy: a tool that can’tsend email without an approval step is safer than a prompt that says “please don’t”.

3. Context engineering

The model can only act on what is in its window. Context engineering is the disciplined work of deciding what gets injected on each turn — the system prompt, the playbook, the relevant document chunks, the conversation so far, the tool outputs — and what gets stripped. Anthropic’s work on long-running agents named the patterns: compaction (summarise old turns when the window gets full), observation masking (hide tool noise the model doesn’t need), and just-in-time retrieval (keep cheap identifiers in context, load the heavy content only when called).

4. Memory

Each session starts with no recollection of the last. Memory is how knowledge from one run carries into the next. Most harnesses ship a memory-file convention — a per-workflow playbook the model is encouraged to update on every run — plus a more structured store for facts: customer preferences, prior decisions, edge cases that should never be repeated. The pattern matters more than the technology. An agent that can’t edit its own playbook is going to make the same mistake on Tuesday it made on Monday.

5. Verification

This is the component most early agent stacks skipped. Verification means giving the agent a feedback signal it can act on: a test suite that runs after a code change, a schema validator that catches malformed JSON, a second-pass classifier that scores the output of the first one, a hook that re-prompts on failure with the specific error. Without verification, every wrong answer ships. With it, the agent gets to self-correct inside the loop.

6. Approval gates

Some actions are reversible. Some are not. Sending a customer email is not reversible. Posting to a public channel is not reversible. Wiring money is very much not reversible. A working harness routes those actions through a queue a named human signs off on, and lets reversible work proceed unattended. The approval gate is the load-bearing piece of every claim about AI being “safe” to put in front of customers.

7. Observability

When the queue stops clearing at 11pm on a Tuesday, somebody has to be able to answer “what did the agent see, what did it decide, which tool returned what?” Observability is the structured trace of every step — prompts, outputs, tool calls, latencies, costs — plus the dashboards and alerting that surface incidents before the customer notices. It is also the substrate evals run against.

8. Routing

Routing is what the 2026 model release cadence forces. DeepSeek V4-Flash, Claude Haiku 4.5, GPT-5-mini and Gemini 3.0 Flash are now cheaper per token than 2024 hobby-tier models — and good enough for high-volume classification, triage and summarisation. Reserve frontier models for the work that needs them. Our routing layer makes per-task model selection a configured rule, not a procurement cycle every time a new release ships. (We wrote up the V4 routing decision separately — DeepSeek V4: when to route to it, when to skip it.)

The four shapes a harness shows up as

Most operators have already touched one of these without naming it. The shapes matter because they decide how much of the runtime the customer ends up owning.

Coding harnesses.The terminal- and IDE-based assistants that write, run and debug software — Anthropic’s Claude Code and Agent SDK, OpenAI’s Codex, Cursor, Cline, Aider. Tight loop, filesystem primary, heavy reliance on a test suite for verification. This is the shape that proved the formula works.
DIY frameworks. Code libraries that let an engineering team assemble its own harness. You still ship the runtime, the memory store, the eval suite, the observability dashboards and the approval UX. Powerful and slow. Pays off when the runtime is your product.
Low-code agent builders.Drag-and-drop platforms that let a competent engineer ship a first workflow in a day. The harness still has to be operated; the platform just decides which surfaces you operate it through. Most of these create a permanent “somebody has to babysit this” role.
Managed harnesses. The runtime is operated for you as a service. You employ the work, not the platform. This is the shape Rebotify ships — one named AI employee, one workflow, live in 48 hours, weekly tuning included.

Build versus buy: where the decision actually lives

Almost every conversation we have with operators eventually arrives here. The question is rarely “should we use AI?” The question is “who runs the harness?”

Build your own if shipping the runtime is your product. If you are a dev-tools company, an agent platform, or a vertical SaaS that needs the harness as a moat, you should own every layer — and you should staff a team that knows what ReAct, compaction, observation masking and eval drift mean by next Tuesday.

Buy a self-serve framework if you have an internal engineering team with appetite for the maintenance load. A platform will let a competent engineer ship a first workflow this month. The hidden cost is the operating role: somebody on payroll now owns the prompts, the eval set, the tool wiring, the on-call when the queue stops clearing, and the migration the next time the model lineup changes. The harness still has to be operated; the platform just decides which surfaces you operate it through.

Hire a managed harness if you want to employ the work, not maintain the runtime. This is the wedge Rebotify is built on. We own the model selection, the prompts, the tool wiring, the memory store, the eval coverage, the approval queues, the observability, and the weekly tuning. You own the workflow, the approvals, and the business outcome — one named AI employee, one queue, live in 48 hours.

Inside our harness, briefly

We do not publish a full architecture diagram — the configuration is per-customer by design — but the named pieces are public. The orchestration loop runs each AI employee through a planning step, a tool step, and a check step, with sub-agent delegation for long-running work. Tool registries are scoped per employee: an inbox-triage employee never gets shell access; a contract-review employee never gets send-email access. Context lives in the customer’s region — Sydney for AU buyers, US-East for North America. Memory is a versioned playbook the operator and the employee both edit, with a weekly review on every workflow. Evals run on a frozen golden set per workflow, with regression alerts when a model swap moves accuracy outside tolerance. Approval gates are non-optional on customer-facing, financial, legal and policy-sensitive actions. Observability is the trace store plus the human-readable weekly report we hand back to you.

The point is not that this is novel. The point is that this is the load-bearing work, and somebody has to do it whether the customer hires us or staffs the role internally.

Four questions to ask before signing anything

If you are evaluating an AI vendor in 2026 and the word “harness” has not come up by the second call, ask these four. The answers separate the vendors that will own outcomes from the ones who will hand you a configuration screen and walk away.

Who owns the runtime when the underlying model gets swapped?
Who owns the eval set when accuracy drifts after a tool change?
Who owns the approval queue when a customer-facing reply needs sign-off at 7am?
Who owns the incident at 11pm on a Tuesday?

Somebody always owns it. The failure mode is discovering, six months in, that the queue is half-tuned, the eval set never got written, and the agent has been quietly shipping the wrong things for three weeks because nobody on either side was the named owner.

The offer

One workflow. One named AI employee. Live in 48 hours. The harness is on us — the orchestration loop, the tool wiring, the memory store, the evals, the approval queues, the observability, the weekly tuning. Flat or per-task pricing. Cancel any time.

Tell us the queue that already costs the most time. Read how managed AI works or see pricing.

— The Rebotify team

HOW THIS WAS WRITTEN

Drafted from notes taken in customer-discovery and operating-team sessions with the AI employees we run for clients. Reviewed and fact-checked by the Rebotify team. AI-assisted at the drafting stage; every named source, model and harness component was checked against a public reference before publication. Updated when the source links change.

FAQ

What is an agent harness, in one sentence?

An agent harness is the runtime software around a language model that decides what the model sees, what tools it can call, how state persists between turns, when work needs human approval, and what happens when something fails.

How is a harness different from an agent framework?

“Framework”, “scaffold”, “runtime” and “harness” overlap. The 2026 convention is that the harness is the production execution layer — the running orchestration loop plus tools, memory, context, verification and safety — while a framework is a code library used to build one.

Why does the harness matter more than the model?

Models keep changing every few weeks. The harness is what stays. Tool wiring, memory, eval coverage, approval gates and incident response are where most of the real work — and most of the failure modes — actually live.

Should we build our own harness or use a managed one?

Build the harness if shipping the runtime is your core product. Use a managed one if you want to employ the work, not maintain the infrastructure. Most operating teams are in the second group.

An agent harness is the runtime between your LLM and the work.