Agent Harnesses

Coding Agent Infrastructure

The canonical reading list on agent harnesses — the scaffolding that wraps LLM calls into reliable coding agents. 30 essays, papers, and podcasts covering Claude Code, Cursor, Devin, Aider, Replit Agent, MCP, and the harness-engineering discipline.

31 articles·4 phases·Updated 5/6/2026·

Curated by@burn451

Get Burn 451

“The harness — not the model — decides whether your agent ships. Memory, tools, permissions, telemetry: that's the surface area that matters in production.”

About this vault

An 'agent harness' is the production scaffolding around an LLM — the retry logic, memory store, tool dispatcher, permission gate, and telemetry pipeline that turns a raw model into something an engineer trusts to run for hours. The model is the engine; the harness is the chassis. By 2026, the gap between research demos and shipped code has collapsed onto exactly this distinction: every team building agentic coding (Anthropic's Claude Code, Cursor's Composer, Cognition's Devin, Replit Agent, OpenHands, Aider) is fundamentally competing on harness quality, not raw model intelligence. This vault collects the thirty most-cited primary sources on harness design — Anthropic's engineering posts on context engineering, tools-for-agents, managed agents, and effective harnesses for long-running work; Princeton's SWE-Agent paper that introduced 'agent-computer interface' as a discipline; Boris Cherny's Latent Space episode on why Claude Code is a Unix utility, not a product; Cursor's Composer architecture and Background Agents announcements; Cognition's Devin technical reports; Lilian Weng's foundational LLM-agent post; Karpathy's vibe-coding origin tweet and follow-up agentic-engineering thread; Harrison Chase on deep agents; Simon Willison's running architectural commentary; and Thorsten Ball's field reports from inside Zed and Sourcegraph Amp. Organized by five phases mapping the 2023-2026 evolution: from hand-rolled agent loops, through IDE-embedded autocomplete, into the Claude Code / MCP era, and now into harness comparison and long-running orchestration. Burn 451's read-later infrastructure is itself a harness pattern applied to personal knowledge — this vault is the reading list for any builder evaluating, copying, or shipping agent infrastructure.

31 articles

All 31 articles

Building Effective Agents

Anthropic's December 2024 post that named the discipline. Distinguishes 'workflows' (LLMs orchestrated through predefined code paths) from 'agents' (LLMs dynamically directing their own tool use), then catalogs the composable patterns that actually ship: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. The thesis — simple composable patterns beat heavyweight frameworks — has become the founding citation for nearly every harness-design discussion since. Required reading before you reach for LangChain, LangGraph, or any other orchestration library.

Effective Context Engineering for AI Agents

Anthropic's reframing of prompt engineering as the strict subset of a larger discipline: managing the entire token budget across system prompts, tools, examples, message history, and retrieved data. Argues the central question has shifted from 'what words do I send?' to 'what configuration of context is most likely to generate the desired behavior?' Diagnoses the most common failure mode — bloated tool sets with ambiguous decision points — and gives concrete techniques for compaction, summarization, and tool curation. The companion piece to Building Effective Agents.

Writing Effective Tools for AI Agents — Using Agents

How to design the tool surface that an agent calls. Treat tool descriptions like onboarding docs for a new hire: name parameters unambiguously (user_id, not user), spell out edge cases, return structured errors. Demonstrates the meta-pattern of using Claude itself to optimize tool definitions, with measured improvements on internal benchmarks. The post that made tool-schema design a first-class concern instead of an API afterthought.

How We Built Our Multi-Agent Research System

The engineering write-up behind Claude's Research feature. A lead agent plans, spawns parallel sub-agents to search, then a CitationAgent attaches sources. Documents the early failure modes — 50 sub-agents for trivial queries, endless searches for non-existent sources, sub-agents distracting each other with status updates — and the prompt-engineering fixes that made it production-grade. The clearest publicly available case study of a multi-agent harness in production.

Effective Harnesses for Long-Running Agents

How to make an agent stay coherent across hours or days when each session starts with a blank context window. Anthropic's solution: an initializer agent that sets up the environment once, plus a coding agent that makes incremental progress and writes a claude-progress.txt file alongside git history for the next session to consume. The post that finally gave 'long-running agent' an answer that isn't just 'bigger context window.'

Scaling Managed Agents: Decoupling the Brain From the Hands

The architectural shift behind Claude Managed Agents: separate the 'brain' (Claude plus harness) from the 'hands' (sandboxes and tools) and the 'session' (the event log). Hands become stateless cattle, not pets. Inference can start before container provisioning finishes, cutting p50 time-to-first-token by ~60% and p95 by >90%. The blueprint enterprise teams are now copying for their own internal agent platforms.

Introducing the Model Context Protocol

The November 2024 launch post for MCP — Anthropic's open standard for connecting AI assistants to data sources and tools. Defines the client/server architecture, ships SDKs in Python, TypeScript, C#, and Java, and seeds the ecosystem with reference servers for Drive, Slack, GitHub, Postgres, and Puppeteer. By 2026, MCP is the closest thing the agent world has to USB-C. Required reading for anyone shipping an agent that needs to talk to anything outside its own process.

Building Agents With the Claude Agent SDK

The renaming announcement: the Claude Code SDK becomes the Claude Agent SDK to reflect that the same harness powers far more than coding agents. Explains the design principle 'give the agent a computer' — file system, shell, browser — instead of trying to predict every tool it will need. Documents the best practices that emerged from running Claude Code internally, then generalizes them as an SDK any developer can build on. The official replacement for hand-rolled agent loops.

Claude Code Sub-Agents

The reference docs for Claude Code's sub-agent feature. Each sub-agent runs in its own context window with a custom system prompt, scoped tool access, and independent permissions, and is described in a markdown file with YAML frontmatter under .claude/agents/. The parent agent delegates and receives only a summary back. The pattern that made hierarchical agent context management cheap enough for everyone — not just teams with platform engineers.

Claude Code: Anthropic's Agent in Your Terminal — Boris Cherny & Cat Wu (Latent Space)

The Latent Space episode where Claude Code's lead engineer Boris Cherny and PM Cat Wu lay out the design philosophy. Memorable lines: 'Claude Code is not a product, it's a Unix utility', 'do the simple thing first' (memory is just a markdown file, not a vector store), and the disclosure that 80-90% of Claude Code's own code is written by Claude Code. The clearest articulation of why the harness is opinionated and minimal — and why that opinionation is the product.

Head of Claude Code: What Happens After Coding Is Solved — Boris Cherny (Lenny's Newsletter)

Cherny's longest-form interview on where Claude Code goes once code itself is no longer the bottleneck. Discusses the internal Anthropic origin (a side experiment, no master plan), the bet on terminal-native over IDE-embedded, the role of CLAUDE.md as durable context, and the next frontier: agents that ship code without human review in the loop. The product-side companion to the Latent Space technical episode.

Agentic Coding: The Future of Software Development With Agents

Simon Willison's June 2025 long-form post arguing that agentic coding (Claude Code, Cursor agent mode, Codex CLI) is now the dominant pattern for using LLMs to write production code, and that prompt engineering as a standalone discipline is being absorbed into harness design. Walks through his own workflow with Claude Code, the rituals that work, and the failure modes that don't. The clearest single observation post on what 'agentic' actually means in 2025 from someone running it daily.

Embracing the Parallel Coding Agent Lifestyle

Willison's argument that running multiple coding agents in parallel — across git worktrees, branches, or cloud sandboxes — is the natural next step once a single agent reaches Claude Code's reliability. Documents his own setup: Claude Code in worktrees plus Cursor Background Agents plus an occasional Codex CLI run, all coordinated through git. The post that pushed the parallel-agent pattern from edge case to default workflow for power users.

Claude Skills Are Awesome, Maybe a Bigger Deal Than MCP

Willison's contrarian read on Anthropic's Skills feature: organized folders of instructions, scripts, and resources that an agent loads on demand — and why this might matter more than MCP for everyday harness design. Argues Skills solve the actual day-to-day problem (composable, version-controllable agent capabilities) more elegantly than MCP servers, which solve the boundary-crossing data-source problem. The kind of practitioner take that reframes the entire ecosystem conversation.

Composer: Building a Fast Frontier Model With RL

Cursor's technical write-up of Composer — a mixture-of-experts model trained with RL across hundreds of thousands of concurrent sandboxed coding environments. Composer is positioned as the in-house engine specialized for software engineering, with frontier-quality output at ~4x the generation speed of comparable models. Documents the infrastructure (rewriting the VM scheduler to handle bursty training workloads) and the training-vs-production environment unification. Required reading on what it takes to ship a model tuned for the harness rather than the other way around.

Introducing Cursor 2.0 and Composer

The Cursor 2.0 launch — agent-first interface, multi-agent architecture, up to 8 parallel agents isolated via git worktrees, and Composer as the default model. Reframes Cursor from 'AI editor' to 'agent harness with an editor attached', and ships the multi-agent pattern Anthropic and others described in research as a shipped product. The bookend to Aman Sanger's 2023 Latent Space appearance: same company, completely different positioning three years later.

A New Tab Model

Cursor's engineering post on the Fusion Tab model — next-edit prediction, not just next-token autocomplete. Predicts where the cursor will move, which files will be touched, and which lines around the current edit will need changes. ~25% better at difficult edits per line and ~10x longer suggestion stretches than the original 2024 release, deployed via online RL with multiple model rollouts per day. The harness pattern for pre-emptive small edits that complements long-running agent mode.

Cursor.so: The AI-First Code Editor — Aman Sanger of Anysphere (Latent Space)

The first podcast that told the Cursor story — August 2023, before any of the agent vocabulary existed. Aman Sanger explains why a small team forked VSCode rather than wait for Microsoft, the early bet on GPT-4, and the original thesis that humans should focus on bigger problems than code. Worth re-reading in 2026 against the Cursor 2.0 announcement to see how much of the original vision held and how much got rewritten by the harness era.

Introducing Devin, the First AI Software Engineer

Cognition's March 2024 launch post for Devin — the demo that made 'autonomous software engineer' a category. Devin plans, executes, browses, runs commands, and iterates on failures inside its own sandbox, with the user looking over its shoulder via a screen-share style UI. Even controversial in its claims, the post defined the autonomous-end of the harness spectrum (long-running, fully delegated) against which IDE-embedded harnesses (Cursor, Claude Code) get measured.

SWE-bench Technical Report (Devin)

Cognition's technical write-up of how Devin reaches 13.86% on SWE-bench — far above the previous 1.96% unassisted baseline. Documents the standardized prompt protocol, the deterministic unit-test evaluation, and the surprising result that 72% of passing solutions take >10 minutes, suggesting iteration depth (not raw model capability) is the dominant factor. The benchmark report that pushed the field toward iteration-budget as a first-class harness metric.

Devin's 2025 Performance Review: 18 Months of Agents at Work

Cognition's year-end retrospective on what Devin actually shipped in 2025 — the categories of tasks it succeeds and fails on, the cost-per-task economics, and the workflow patterns customers converge on (Slack-driven, Linear-ticket-driven, GitHub-PR-driven). The most candid post-mortem on autonomous-agent reality from the company that has the most production data. Counter-narrative to both the hype cycle and the dismissive 'Devin is fake' threads.

On Programming With Agents

Thorsten Ball's framework for working with agents from inside Zed: keep tasks small enough to review in one sitting, watch for early signs of drift (unexpected file changes, repeated retries), require human review of every line before it ships. Written from the editor-builder perspective — Zed's job is to design the surface that makes this review practical. The most honest field report on what 'agent in the loop' actually feels like as a daily practice for someone who refuses to abdicate craft.

Aider: AI Pair Programming in Your Terminal

Aider's canonical README — the open-source terminal harness that pre-dates Claude Code by a year. Connects to Claude, GPT-4, Gemini, or local models, builds a git-backed repo map for context, and commits every edit as an atomic descriptive commit (so reverting is just git reset). Architect/Editor mode pairs a strong-reasoning planner with a fast-cheap editor, hitting 85% on Aider's own multi-file edit benchmark. The reference implementation of the 'git-native pair programmer' harness pattern.

LLM-Powered Autonomous Agents

The June 2023 Lil'Log post that codified the LLM-agent triad: planning (subgoal decomposition, reflection, refinement), memory (short-term context plus long-term external store), and tool use. Cites AutoGPT, GPT-Engineer, BabyAGI as proof-of-concept demos and lays out the architecture pattern every subsequent harness builds on. The single most-cited founding text for the entire 2024-2026 agent harness literature — read this first or you'll keep rediscovering its categories.

SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering

The Princeton paper that named the discipline of 'agent-computer interface' (ACI) design — arguing language-model agents are a new category of end users and deserve specially-built interfaces, not human IDEs retrofitted. Demonstrates that careful ACI design alone (no model fine-tuning) takes SWE-bench from 3.8% to 12.5%. The academic citation that legitimized harness engineering as a research field rather than just a product trick.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

The OpenHands (formerly OpenDevin) paper — a community-built MIT-licensed platform for coding agents with sandboxed code execution, multi-agent coordination, and built-in evaluation harness for SWE-bench, WebArena, and 13 other benchmarks. 2,100+ contributions from 188+ contributors by publication. The open-source reference architecture for anyone who wants to study or extend a full agent harness without a corporate behind it.

There's a New Kind of Coding I Call 'Vibe Coding'

The February 2025 throwaway tweet that named a category. Karpathy describes 'vibe coding' as fully giving in to the vibes, embracing exponentials, forgetting the code exists — possible because Cursor Composer plus Sonnet plus SuperWhisper voice input crossed a usability threshold. 4.5M+ views, made it into Merriam-Webster, and split the discourse into 'vibe coding' (raises the floor for beginners) versus 'agentic engineering' (raises the ceiling for professionals). The single most-cited tweet in the agent era.

The Rhythm of AI-Assisted Coding (Stuff Everything Into Context, Then Iterate)

Karpathy's follow-up thread on his actual workflow when the code matters: stuff everything relevant into context first (this can take a while), then iterate. The thread that distinguished 'agentic engineering' from 'vibe coding' — same tools, completely different discipline. Worth pinning next to the original vibe-coding tweet to remind yourself which mode you're in. The closest thing the field has to a written practitioner method.

Deep Agents

Harrison Chase's July 2025 post that named the next generation of agent architecture. A simple LLM-in-a-loop is 'shallow' and breaks on long horizons; a 'deep agent' adds four things: a planning tool, sub-agents, file-system access, and a detailed system prompt. Cites Deep Research, Manus, and Claude Code as the three reference implementations. The post that gave LangChain a coherent positioning post-LangGraph and pulled the rest of the orchestration ecosystem along with it.

Reflections on Three Years of Building LangChain

Harrison Chase's October 2025 retrospective — three years from chatbot prototypes to production agents. Honest about what LangChain got wrong (over-abstraction in the early days, framework-vs-pattern friction with the Anthropic 'Building Effective Agents' camp), and what the orchestration layer needs to look like now. Pairs with 'Deep Agents' and 'Not Another Workflow Builder' as the trilogy that explains LangChain's 2026 positioning. Useful as a counterweight to the Anthropic-only reading list.

GitHub Copilot Workspace: Welcome to the Copilot-Native Developer Environment

GitHub's April 2024 announcement of Copilot Workspace — the first major attempt to wrap an agent harness inside the GitHub PR flow itself. Start from a GitHub issue, brainstorm, get a generated plan, iterate in natural language, ship a PR. Useful as the canonical 'platform-embedded harness' example to contrast against terminal-native (Claude Code, Aider) and IDE-embedded (Cursor, Zed). Also the post that confirmed every major dev-tools vendor would now have a harness story whether they wanted one or not.

Start reading, not hoarding.

Import this vault to Burn 451 and actually read what matters.

Concept hubs

→ Agentic Engineering → Vibe Coding Browse all concept pages →

Other vaults

→ Andrej Karpathy vault → Simon Willison vault → Paul Graham vault → Naval Ravikant vault

Frequently asked questions

Who is Agent Harnesses?

Agent Harnesses is covered in this Burn 451 vault with a focus on coding agent infrastructure. The canonical reading list on agent harnesses — the scaffolding that wraps LLM calls into reliable coding agents. 30 essays, papers, and podcasts covering Claude Code, Cursor, Devin, Aider, Replit Agent, MCP, and the harness-engineering discipline.

How was the Agent Harnesses vault curated?

The Agent Harnesses vault was hand-curated by the Burn 451 editorial team from publicly available essays, blog posts, podcast transcripts, and social threads. Each piece includes an AI-generated summary so readers can triage in seconds. The vault auto-syncs as new content from Agent Harnesses is published.

How many articles are in the Agent Harnesses vault?

The Agent Harnesses vault currently contains 31 curated pieces organized by topic, not chronology. Each article has an AI summary and a direct link to the original source. Items are refreshed hourly through Burn 451's ISR pipeline, so new publications appear within a day.

How do I use this vault with Claude or Cursor?

Install the burn-mcp-server package from npm and connect it to Claude, Cursor, or any MCP-compatible AI tool. The vault becomes queryable as live context — your AI can search, summarize, and cite articles from Agent Harnesses directly in conversation without manual copy-paste or re-uploading files.

What is Burn 451?

Burn 451 is a read-later app built around a 24-hour burn timer that forces daily triage. Articles you save must be read, vaulted, or released within 24 hours. The Vault layer — including this Agent Harnesses collection — holds permanent curated reading lists for AI thought leaders, founders, and researchers.

Content attributed to original authors. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.