Building a Coding Agent Harness From First Principles

Most teams start building coding agents by wiring a model to a prompt and a few tools. That works for demos. It fails quietly in production.

What actually ships is not “the model”—it’s the harness: the control system that turns model output into real-world effects while keeping the system debuggable, safe, and coherent over long sessions.

In this post, I want to walk through how I’d build that harness from first principles, grounded in a concrete system design. I’m not trying to sell you a framework; I’m trying to leave you with a mental model. After reading this, you should be able to sketch a harness that can run for hours, survive failures, and coordinate tools without corrupting state.

Everything here stitches together ideas I’ve seen scattered across good system docs—loop design, memory, projection, scheduling, permissions—but I’m presenting them as one build narrative rather than a spec checklist.

The problem

A coding agent is not a request/response system. It’s a stateful control loop with side effects. Once you say that out loud, three classes of problems stop sounding like “edge cases” and start sounding like structural bugs in naive designs.

Iterated inference breaks “one request” assumptions. After the model calls a tool, you’re in a loop: model → tool → result → model → …. If your stack assumes one API call per user turn, you eventually get duplicate side effects, inconsistent transcripts, and partial execution you can’t replay.

Context explodes over time. Long sessions hit token limits, enormous tool outputs, and the slow erosion of earlier signal. “Just send the full history” is a strategy with a very short half-life.

Tool execution is not just function calling. Tools bring ordering constraints (writes vs reads), permission decisions, and failure recovery. Treating them as trivial RPC calls is how you get race conditions and unsafe execution—the kind that doesn’t show up in a ten-minute demo.

Initial thinking (and where the first version breaks)

The first version a lot of people build is intuitive: take user input, append to messages, call the model, if there’s a tool call then execute it, append the result, repeat. That almost works.

Where it breaks is less philosophical than operational. Without a clean split between transcript and prompt, context becomes unstable. Without scheduling, concurrent tool calls corrupt state. Without compaction, long sessions die on token limits. Without a permission layer, the system is unsafe by default. Without structured loop control, you can’t debug or abort cleanly—so you end up patching behavior reactively instead of designing a system.

A breakdown of the mental model

1. The harness is a control loop

At the center is a loop—not a request. The loop owns state, iteration, and termination. Everything else plugs into it. That’s the architectural center of gravity.

flowchart LR
  subgraph loop["Query loop"]
    M[Model stream]
    T[Tool calls]
    R[Results → log]
    M --> T --> R --> M
  end
  L[(Event log / transcript)]
  R --> L
  L --> P[Projection]
  P --> M

2. Transcript vs prompt projection

You really do want two representations, and conflating them is how systems become impossible to reason about.

Layer	Purpose
Transcript (source of truth)	Full history—for UI, audit, replay
API projection	Exactly what the model sees this turn

The projection step is where you handle compaction, trimming, tool-output replacement, and injected context. Skip that layer and you get something non-reproducible, cache-unstable, and painful to debug. In practice, I treat this separation as non-negotiable—not because I love ceremony, but because it’s where determinism lives.

3. Context is a pipeline, not a string

Context assembly has to be ordered and explicit. A typical pipeline might look like this in principle:

Apply tool output budgets
Trim history (when needed)
Run micro-compaction, then full compaction if you’re up against limits
Inject system + user context
Attach memory / repo state

Order isn’t an implementation detail—it affects caching, determinism, and correctness. Context isn’t “a prompt”; it’s a build pipeline per turn.

4. Tool execution is scheduling

Tool execution behaves like a constrained scheduler: read-only work can often parallelize; mutations should serialize; when you’re unsure about safety, you fall back to conservative ordering. That’s not premature optimization—it’s how you avoid filesystem races, inconsistent state, and flaky “it worked twice” behavior.

5. Permissions are a first-class system

Every tool call should pass through one decision point: allow, deny, or require confirmation—no exceptions. When permission logic scatters, you lose auditability, introduce security gaps, and stop being able to explain what the system would have done under different policy. Permissions aren’t a product feature tacked onto tools; they’re part of the execution model.

6. Input is a compiler

User input isn’t raw text floating into the model. It gets transformed into structured messages: slash commands, attachments, environment context, meta messages about the session. That pipeline is what stops UI semantics from leaking into the model and keeps replay behavior consistent. I think of it as a small compiler from intent → messages.

7. Memory is three separate systems

Most teams collapse these; they behave badly when you merge them.

Type	Role
Prompt context	What the model sees now
Transcript	Full interaction history
Session memory	Durable, externalized notes—often updated asynchronously

Session memory should not block the main loop. When you mix all three into one blob, you get bloated prompts, lost information, and unpredictable recall. Keeping them orthogonal is annoying upfront and saves you weeks later.

The architecture I’d implement

At a high level, the components line up cleanly:

Event log (transcript) — append-only, source of truth
Query loop — async generator handling iteration, streaming, termination
Projection layer — log → API messages, with compaction and budgeting
Tool scheduler — partitions parallel vs serial work
Permission gate — centralized decisions
Context pipeline — deterministic ordering
Optional memory system — async, often a forked or secondary process/agent

Build order (practical): I’d still phase this in so complexity doesn’t arrive before the loop exists.

Phase 1 — Loop: Implement the query loop, handle tool calls, add abort support.
Phase 2 — Safety: Introduce the permission system and basic tool scheduling.
Phase 3 — Projection: Add compaction and tool-result budgeting.
Phase 4 — Subagents: Reuse the same loop recursively where it makes sense.
Phase 5 — Memory: Add async summarization or external memory writes.

That order avoids painting yourself into a corner: you commit to the architecture early without having to build every subsystem on day one.

Implementation sketch

Here’s a minimal Python shape of the loop that captures the core idea—projection is explicit, the loop owns control flow, tools feed back into state, and termination is obvious.

from collections.abc import AsyncIterator
from typing import Any


async def query_loop(state: Any, deps: Any) -> AsyncIterator[Any]:
    while True:
        messages = prepare_messages_for_api(state.log, state.budgets)
        stream = await deps.model.stream(messages=messages, tools=deps.tools)

        tool_calls: list[Any] = []
        async for delta in stream:
            yield delta
            if delta.get("type") == "tool_use":
                tool_calls.append(delta)

        if not tool_calls:
            return

        results = await run_tools(tool_calls, deps)
        state.log.extend(results)

What matters isn’t the surface syntax—it’s the invariants: projection happens every turn, the loop is the control plane, tool results land back in durable state, and “no tools” is a real terminal condition. Everything else (compaction policies, permission checks, scheduling rules) plugs into prepare_messages_for_api, run_tools, and how you append to state.log.

What I’d do differently next time

If I were building this again from scratch, a few regrets would guide me earlier.

I’d start with projection sooner. Most teams add compaction after the pain is already chronic; even a naive projection layer from day one keeps you honest.

I’d over-invest in observability: time to first token, tool execution latency, compaction frequency. Without those signals, you’re tuning blind.

I’d be stricter about tool schemas earlier. Ambiguous inputs destroy concurrency guarantees and make scheduling guesses worthless.

I’d treat subagents as first-class earlier—recursive reuse of the same harness simplifies the overall system, but only if you design for it instead of bolting it on.

And I’d write invariants in comments—message shape constraints, streaming edge cases, tool-call semantics. That sounds boring until 3 a.m. when production diverges and those comments are the only map you have.