Long-Running Agents

A practical handbook of principles and lessons for operating autonomous, long-running, self-improving agents on complex tasks.

Nayeem Zen · @nayeemzen · linkedin.com/in/nayeemzenFebruary 2026 · Download PDF.

Scroll
PLAN plan.md BUILD checklist.yaml VERIFY build · test · lint AUDIT run-log.md EVOLVE skills & evals Constraints Checkpoints Evidence Decisions Improvement
The Short Version — If you only remember a few things
  1. Start with Codex CLI + GPT-5.2 (high/xhigh). This is the current frontier. Use other stacks when you have a clear reason.
  2. Make "done" measurable. Write acceptance criteria the agent can prove with tools, like tests, builds, lint, or queries. The build-verify loop is non-negotiable.
  3. Keep durable memory outside the chat. plan.md is the blueprint. checklist.yaml is current status. run-log.md is the audit trail. Your agent should refer to them across compaction and restarts.
  4. Observability is key. Durable memory plus audit logs are for async observability and post-mortems; Codex reasoning traces and thinking blocks are real-time observability.
  5. Steer early. Drift compounds. Use your observability system to steer the agent. A short correction in the first 10 minutes beats an hour of cleanup.
  6. Keep the verification loop cheap. Fast tests often, full suite at milestones, and CI always running in the background, continuously polled by the agent.
  7. Checkpoint constantly. Small commits, reversible steps, and separate checkouts for parallel work.
  8. Review by risk. Agent-first review, then human review on high blast radius areas.
  9. Self-evolve after every run. Extract skills, agent guideline updates (AGENTS.md), and repo evals from run-log.md.
Section

Introduction

Chapter 01

Who this is for

Chapter 02

How to use this playbook

  1. Skim Stack and setup, Operating system, and Cost management so you pick a stack and a workflow.
  2. Copy the durable memory system templates into your repo.
  3. Run one small long job first, in a safe environment, with a strict verification loop.
  4. Only then scale to overnight runs and parallel worktrees.
  5. After you ship, extract follow-ups from run-log.md into skills, repo evals, and agent guidelines (AGENTS.md).
Chapter 03

What is a long-running agent?

Long-running agents are coding agents that can run for hours or days with minimal intervention. You hand them a scoped job, a verification loop, and a memory system. They keep working, checkpointing, and producing evidence until your task is verifiably complete.

Cursor has written about how their agents work continuously over a week on a single codebase. Examples include a web browser built from scratch with 1M+ lines across about 1,000 files, and a Java LSP with 2.5M+ lines across about 10,000 files. They also report peaks around 10M tool calls in a week, with bursts around 1,000 commits per hour. Source: Scaling long-running autonomous coding.

This playbook is an operating model for how I achieved similar results. It is opinionated. It assumes you want throughput and you still want to ship safely: green CI, reviewable diffs, rollback paths, and no surprises in production.

In concrete terms, a long-running agent:

WHAT AN AGENT DOES PLANS Multi-step work VERIFIES Build-verify loop SELF-HEALS Flakes · timeouts · errors MAINTAINS STATE Compaction · resume · fork AUDITS Checkpoints · logs · evidence SELF-EVOLVES Skills · evals · guides

Where long-running agents shine

Great fits:

Where they are a bad fit

Prefer interactive work when:

Section

Stack and setup

Chapter 04

Models

What you want in a long-running model

Look for:

My point-in-time take (Feb 2026)

If you want one default: Codex CLI + GPT-5.2 (high or xhigh).

For long-running agent work, this is the frontier stack right now. It runs tool loops for hours, compaction stays coherent, and it is easy to steer in real time.

The big advantage is in-flight visibility. Codex shows the reasoning trace in real-time so you can redirect early. In my experience this saves a ton of time compared to harnesses where you mostly see output after a long step finishes.

Codex's reasoning trace (thinking blocks) pairs strongly with its steering feature. You can see where the agent is headed mid-flight and cut off bad paths early by sending messages in real-time without interrupting its agent loop.

Use your durable memory system for async observability. It is what you read in the morning, after compaction, and after restarts. Treat it as the system of record for decisions and evidence, even if you have thinking blocks.

Claude Code does not expose internal reasoning the same way today. You can still run long jobs, but you should compensate with shorter execution steps, more checkpoints, and stricter logging.

I use other harnesses for three reasons:

Claude notes (Opus 4.6)

Opus 4.6 is strong and I still use it for interactive work.

For long-running background runs, it is not my default. Cursor put it bluntly: "Opus 4.5 tends to stop earlier and take shortcuts when convenient." Source: Scaling long-running autonomous coding.

While Anthropic's Opus 4.6 release claims better performance at long-running tasks, anecdotally it still gives up more easily compared to GPT-5.2/Codex-5.3 models. Apart from the training, I also suspect its partly because OpenAI allocates a very generous thinking budget on high and xhigh reasoning levels in Codex.

If you are in a Claude-first stack, you can still run long jobs. You will usually want a stronger harness loop (hooks and a "Ralph loop"), plus the same fundamentals in this playbook. More on that in Harnesses.

Reasoning effort (thinking levels)

Most serious harnesses expose a reasoning effort setting like low, medium, high, xhigh.

Practical rule:

Open-weight models as execution engines

Open-weight models are now good enough to matter. With a strict verification loop, they can do a lot of the execution work at a fraction of the cost.

A couple outside takes match what I see:

How I use them:

Two choices matter here:

Pick the harness for workflow. Pick the provider for model access, pricing, caching, and compliance.

Kimi K2.5 cost vs performance chart.
Kimi K2.5 cost vs performance (source: Kimi K2.5):
GLM-5 benchmark summary chart.
GLM-5 benchmark summary (source: Z.ai LLM guide):
MiniMax M2.5 benchmark summary chart.
MiniMax M2.5 benchmark summary (source: MiniMax M2.5):
Chapter 05

Harnesses (agent runners)

A harness is the software around the model. It is the difference between "chat with a model" and "operate an agent".

HARNESS LANDSCAPE Codex CLI Default for long runs Plan mode · Steering Session persistence Built-in review Claude Code Feature-rich Hooks · Stop hooks Ralph loop pattern Plugins ecosystem Cursor Background agents Concrete public writing Scaled agent ops Portable model OpenCode · Pi Model-agnostic Terminal-first UX Extensible & hackable Multi-provider WHAT TO LOOK FOR IN A HARNESS Plan + execute modes Session persistence Coherent compaction Steering ergonomics Tool permissions Background terminals Skills + subagents Multi-provider

What to look for

Codex (CLI)

This is my default harness for long-running coding work.

Why it is best-in-class for long runs:

Two features I use constantly:

Practical tips:

Claude Code

Claude Code is one of the most feature-rich harnesses today.

If you want Claude Code to behave like a long-running runner, you need a "keep going even if you want to stop" loop for yoru agent:

Anthropic documents hooks and the Ralph Wiggum plugin. See Claude Code hooks and Claude Code plugins. Many also roll their own bash loop so they can restart fresh sessions with next context when long runs start to experience context rot.

If you are running Opus for hours, external memory is the crux. Treat plan.md, checklist.yaml, and run-log.md as mandatory, and design your loop to reread them after compaction and restarts.

Cursor background agents

Cursor has some of the most concrete public writing on long-running agents at scale. Read Scaling long-running autonomous coding even if you do not use Cursor.

The operating model is portable: treat agent runs like running software. Monitor them, recover from errors, and expect loops that can span a full day.

OpenCode

OpenCode is a model-agnostic harness worth understanding.

It has:

Docs: OpenCode.

Pi

Pi is a small, extensible terminal harness. It is a good fit when you want a model-agnostic core that you can extend, and you want to bake in your own workflows and memory system.

Why it matters for long-running work:

Pi is also used as an SDK in projects like OpenClaw, which is a good testament to its extensibility. See Pi and OpenClaw.

Chapter 06

Tracking the frontier

This space moves fast. The meta-skill is switching stacks without chaos.

My loop:

Evals (measure your reality)

An eval is a small, repeatable task with a known "done" state. It measures the stuff you care about: repo conventions, tool access, flaky tests, migrations, CI, and constraints.

Benchmarks are useful. They are not enough. Your evals are what you should optimize for.

How to build them:

Chapter 07

Cost management

Long-running agents can burn a lot of tokens. Set a budget. Pick a routing strategy. Then run your verification loop often so you do not pay to learn the same thing twice.

Pick a cost model that matches the work

ChatGPT Pro vs Claude Max

If you want to run long background agents today, ChatGPT Pro is the best subscription for it.

Price check: ChatGPT Pro is $200/month as of 2026-02-24. See ChatGPT pricing.

Two real reasons:

In one refactor run that consumed 52M+ tokens (and 2B+ cached tokens), I still had about 35% of my weekly limit remaining on Pro. I was also doing other heavy Codex usage that did not appear in the token count. The limit window reset by the time the job finished, so it had no real impact on my next runs.

These plans evolve. Check the current limits before you commit to a workflow. Some people buy multiple subscriptions to increase headroom. Read the terms and use judgment.

Claude Max has strict five-hour windows and message caps. That is fine for interactive work. It becomes a bottleneck for multi-hour background runs and parallel agents. See Anthropic usage limit best practices.

Corporate reality and model routing

If you are doing this on a company codebase, a personal flat-rate plan might not be viable. Compliance, billing, and data policies usually push you toward enterprise plans or API usage.

If you are paying per token, do not run everything on the most expensive model. Route:

Concrete examples:

The rule is simple: switching models should not change the verification bar. Keep the same acceptance criteria, evidence, and audit trail.

A real run (token scale)

This is what one big refactor run looked like for me:

Metric Tokens
Total 52,146,126
Input 46,053,693
Cached input 2,164,821,248
Output 6,092,433
Reasoning 3,542,400

Cached input is the story here. That is repeated context. Prompt caching changes the economics by an order of magnitude.

What this run would cost on APIs

Assumptions:

Option Input ($/1M) Cached input ($/1M) Output ($/1M) Est cost (this run) Notes
ChatGPT Pro n/a n/a n/a $200.00 Flat subscription if you stay within limits
GPT-5.2 1.75 0.175 14.00 $544.73 OpenAI API pricing
Claude Opus 4.6 5.00 0.50 25.00 $1,464.99 Anthropic pricing + prompt caching reads
Kimi K2.5 0.60 0.10 3.00 $262.39 Moonshot pricing + context caching
GLM-4.7 0.60 0.11 2.20 $279.17 Z.ai pricing (cached storage limited-time free)
GLM-5 1.00 0.20 3.20 $498.51 Z.ai pricing (cached storage limited-time free)

Pricing sources:

Spend tokens where they buy leverage

Good places to spend:

Places to be stingy:

Chapter 08

Safety and permissions

Long-running agents work because they can run unattended. That also means they can fail unattended.

CONSTRAINED MODE Sandbox + allowlists Read-only by default Approvals for destructive steps No secrets by default AUTONOMY MODE Disposable machine No personal data or keys Low-privilege tokens --yolo for full access

There are two sane ways to operate.

Constrained mode (sandbox + allowlists)

Use this when the agent is running on your real machine, or it can touch anything you care about.

This mode takes setup. It reduces surprise.

Autonomy mode (free rein on a disposable machine)

If you want real autonomy, isolate first and then loosen permissions.

If the environment is disposable, the agent can safely do open-ended work like installing CLIs, running scripts, and chasing flaky failures.

Secrets and data

Rules I follow:

When to use --yolo

--yolo (or similar "dangerous" modes) is for environments that are already isolated.

This mode is often necessary for open-ended jobs. When an agent is stuck, it may need to install tooling, poke at the network, or run cleanup tasks without waiting for you.

Good patterns:

If you go this route, treat the environment as disposable. Assume it may get corrupted.

Isolation options

If you want full autonomy without fear:

Section

Operating system

Long-running agents succeed when you treat them like production systems.

Non-negotiables:

Chapter 09

The three-file memory system

For long runs, I keep three files at the root of the repo:

  1. plan.md for the blueprint and constraints
  2. checklist.yaml for executable work items and acceptance criteria
  3. run-log.md for an append-only ops log (decisions, evidence, failures, fixes)
THE THREE-FILE MEMORY SYSTEM plan.md Blueprint Constraints Architecture Verification strategy checklist.yaml Work items Acceptance criteria Current status Active task tracking run-log.md Append-only ops log Decisions & evidence Failures & fixes Audit trail

These files do three jobs:

Chapter 10

Skills, MCP, and CLI tools

Tooling is the difference between a plain-text LLM and an autonomous agent.

Skills (small, reusable playbooks)

Skills are small instruction bundles that teach your agent how to do a specific thing inside your world.

There is an emerging "Agent Skills" standard for packaging skills in a portable way. See Agent Skills.

Typical structure:

The key trick is progressive disclosure:

Good skills:

Examples worth having in most repos:

A strong pattern:

  1. Do a small slice interactively.
  2. Turn that into a skill with a few golden examples.
  3. Run the long job using the skill.

If you want a good mental model for skills and progressive disclosure, read Skills and shell tips.

MCP servers (capabilities with context cost)

MCP (Model Context Protocol) is a standard way to connect agents to external tools and data sources.

Common uses:

High leverage examples:

Tradeoff: MCP can be heavy. It can bloat the context with tool instructions and verbose outputs.

Use MCP when:

Use CLI tools when:

You might not need MCP

MCP is useful. It is also easy to reach for too early.

Most harnesses load tool descriptions eagerly at the start of the run so the model knows what is available. This also helps prompt caching. The context cost still counts, even when the tokens are cheap.

Two problems show up fast:

  1. Context tax. Big tool surfaces steal working memory. For browser tooling, it is common to burn about 14k to 18k tokens just describing the toolset. That cost hits before the job starts. (What if you don't need MCP at all?)

  2. Weak composability. A lot of composition happens through inference instead of code. Outputs flow back through the model, then you ask it to stitch the steps together. Long runs get harder to rerun, harder to audit, and harder to keep deterministic. Code and CLI workflows compose cleanly. (Tools: Code Is All You Need)

A code-first setup is boring and it works:

Use MCP when you need real system access with auth, or when the external system is the work (SaaS, tickets, CRM, on-call tooling). Keep servers minimal and outputs structured.

Cloudflare's "Code Mode" is a cool example of where this is heading: a fixed tool surface (search + execute), a typed SDK, and a sandboxed runtime. It is new, so treat it as a signal and watch the space. (Code Mode: give agents an entire API in 1,000 tokens)

Browser automation (real verification)

Browser automation is a great verification loop when:

Common approaches:

Two rules:

  1. Keep flows short and deterministic.
  2. Save screenshots and logs to files, then link them in run-log.md.

CLI tools are underrated

Once you internalize the code-first approach, CLI tools become the default tool layer.

CLI tools are perfect for long-running agents:

If a tool is well known (like gh), the agent will usually figure it out. If a tool is obscure, have the agent write a wrapper script and a short README, then turn it into a skill.

Chapter 11

Planning

Long runs are won in the planning phase.

Start by forcing clarity

Ask the agent to ask you questions until scope is crisp:

Kickoff (one prompt, three files)

A long run starts the same way every time:

  1. Copy the templates from this playbook into your repo root: plan.md, checklist.yaml, run-log.md.
  2. Paste the kickoff prompt from Templates into your harness. It should fill in those files, then stop and wait for approval.
  3. Approve the plan and checklist.
  4. Execute checklist.yaml until the tasks reach terminal states.

One rule: do not keep a second plan in chat. The plan lives in plan.md. Execution state lives in checklist.yaml. Evidence lives in run-log.md.

Define your verification loop

Long-running agents need a self-verifying feedback loop. Pick the loop, then make it explicit in the plan.

Common loops:

If you do not have a loop, build the loop first.

Pattern: red/green TDD. A simple, agent-friendly build-verify loop is: write a test that fails (red), make it pass (green), repeat. It forces the agent to prove progress and prevents plausible-but-wrong changes.

  1. Red: Add or update a test that captures the next acceptance criterion. Run it and confirm it fails for the expected reason.
  2. Green: Make the smallest change that makes that test pass. Rerun and capture the passing output as evidence.
  3. Repeat: Move to the next slice. If the new test never failed, it is not testing the behavior you think it is.

Use "red/green TDD" explicitly in prompts; models understand the shorthand. Source: Simon Willison: Red-green TDD (agentic engineering patterns).

Configure skills and MCP (before refactors)

Skills (and MCP if you need data from external systems) are critical to make sure your agent has all the tools for a successful run.

Skills

MCP

Golden examples (especially for refactors)

Golden examples work best when they are encoded into skills.

Workflow:

  1. Ask the agent to scan the repo and propose 10 to 20 candidate files that represent the diversity of patterns.
  2. You pick 3 to 5.
  3. Refactor those interactively, verify them, and lock the pattern.
  4. Turn the pattern into a skill and a checklist section.

Now the long-running agent can replicate the pattern safely.

A standard overnight refactor recipe

Pre-flight:

  1. Make sure main is green, or capture current failures in run-log.md.
  2. Create a checkout for the run (worktree or separate clone).
  3. Use the repo scan method to select golden examples, then implement them interactively.
  4. Convert the pattern into a skill and reference it from plan.md.

Kickoff:

  1. Generate checklist.yaml with tasks that have real acceptance criteria.
  2. Open a draft PR early and let CI run continuously.
  3. Start the long run with clear rules about commits, verification, and logging.

During the run:

  1. Check in once early to confirm it is running the build-verify loop.
  2. Use the harness thinking view and steer when you see drift, retries, or unsafe changes.
  3. Keep the agent moving forward. Do not let it fight flakes for hours.

Landing:

  1. Run the full verification suite from scratch.
  2. Run agent-first review (/review in Codex or your review subagent).
  3. Do a human review by risk in the code review tool.
Chapter 12

Templates

Copy these into your repo. These files are the memory, observability, and self-evolution layer for long-running agents.

This section has four copy-paste blocks:

The plan.md template below borrows ideas from OpenAI's ExecPlan approach for Codex. See Using PLANS.md for multi-hour problem solving.

Kickoff prompt

Paste this into your harness to start a long run. It forces the agent to fill in plan.md, checklist.yaml, and run-log.md at the repo root instead of inventing new structure.

You are going to run a long background job in this repo.

Before writing code:
1) Ask me clarifying questions until you can fill in `plan.md`, `checklist.yaml`, and `run-log.md` at the repo root. Follow the structure already in those files. Do not invent a new format.
2) Create or update those three files at the repo root.
3) Stop and wait for approval.

After I approve:
- Execute `checklist.yaml` until tasks reach terminal states.
- Treat the build-verify loop as the source of truth. Propose the fast loop and the full suite.
- Keep `checklist.yaml` updated as you work. Do not delete tasks. Append new tasks when discovered.
- Record decisions and evidence in `run-log.md` with timestamps.
- After compaction or restarts, reread `plan.md`, `checklist.yaml`, and `run-log.md` before continuing.
- Keep diffs reviewable: small commits, reversible steps, milestone checkpoints.

plan.md

# Plan: <project name>

Last updated: <YYYY-MM-DD HH:MM>

This plan is a living document. Keep it self-contained. Assume a new person, or a fresh agent session, will restart from:
- `plan.md` (this file)
- `checklist.yaml` (execution state machine + acceptance criteria)
- `run-log.md` (append-only ops log: decisions, evidence, failures, fixes)

## Purpose / Big picture

In 3 to 6 sentences:
- Why this work matters
- What changes for a user
- How to see it working

## Progress and logs

- Progress is tracked in `checklist.yaml`. Keep it updated as you go.
- Decisions, surprises, and evidence go in `run-log.md` with timestamps.

## Context and orientation

Explain the current state as if the reader knows nothing about this repo.

- System overview:
- Key files (repo-relative paths):
  - `path/to/file`: <why it matters>
- Glossary (define any non-obvious term you use):
  - <term>: <definition>

## Scope

### Goal
<1 paragraph, concrete and testable>

### Non-goals
- <bullets>

### Acceptance criteria (behavior)
- <bullets that a human can verify>

## Constraints and invariants

- Must not change:
- Safety:
- Compatibility:
- Performance:

## Plan of work (milestones)

Write this as a short narrative. For each milestone, say what exists at the end that does not exist now, and how to prove it.

1. Milestone 1: <name>
   - Outcome:
   - Proof:
2. Milestone 2: <name>
   - Outcome:
   - Proof:

## Implementation map

Describe the concrete edits you expect.

- Change:
  - File:
  - Location (function, module, class):
  - What to change:
- Add:
  - File:
  - What to add:

Keep `checklist.yaml` aligned with this plan. If you discover new required work, add tasks. Do not delete tasks.

## Verification

- Fast loop (iteration):
  - `...`
- Full suite (milestones):
  - `...`
- Expected outputs:
  - <what "green" looks like>
- Evidence to capture in `run-log.md`:
  - <commands run, links, screenshots, perf numbers, query results>

## Rollout and rollback

- Rollout steps:
- Feature flags:
- Backfills and migrations:
- Rollback plan:

## Idempotence and recovery

- What can be rerun safely:
- What can fail halfway:
- How to retry safely:
- How to clean up:

## Interfaces and dependencies

Be explicit about:
- New APIs or contracts:
- New dependencies:
- Migrations and compatibility:

## Skills and tools

- Skills to use:
  - <skill name>: <why>
- MCP servers (if any):
  - <server>: <why>

## Golden examples

- <links to commits/files that define the pattern>

## Open questions

- <questions that block execution>

checklist.yaml

Agent checklist format I use in other repos.

summary:
  last_updated: "<YYYY-MM-DD HH:MM>"
  active_task_id: ""
  status: "not_started"
  blockers: []
  next_actions: []
  verification_loop:
    fast: []
    full: []

instructions: |-
  This `checklist.yaml` is the execution plan for the repo. It is the source of truth for what "done" means.

  Agent workflow:
  - Implement tasks until they reach a terminal state: `complete`, `failed`, or `archived_as_irrelevant`.
  - Do not mark a task `complete` until every item in `acceptance_criteria` has been met and verified.
  - Verify acceptance criteria with real evidence (commands/tests run, screenshots, API calls, queries).
  - Record assumptions, decisions, and verification evidence in each task's `note`, and in `run-log.md`.
  - If a task fails after multiple approaches, set status to `failed` and summarize what was tried and why it failed.
  - Keep tasks updated as you work: `not_started` -> `in_progress` -> `complete` (or `failed` / `archived_as_irrelevant`).
  - Add new granular tasks when new required work is discovered. Do not delete tasks. Archive if irrelevant.

  Task schema:
  - `task_id`: Stable unique identifier (string).
  - `title`: Short summary (string).
  - `description`: What to build and why (string; can be multiline).
  - `acceptance_criteria`: Verifiable checklist items (list of strings).
  - `implementation_notes`: Optional tips or constraints (string; optional).
  - `status`: One of `not_started`, `in_progress`, `complete`, `failed`, `archived_as_irrelevant`.
  - `note`: Freeform working log for the agent (string). Leave empty until there is something worth recording.

guidelines:
  - "Do not change public APIs without updating callers."
  - "Do not add dependencies without justification."
  - "Prefer small, reversible steps and frequent verification."
  - "Do not commit secrets."

tasks:
  - task_id: T001
    title: "Establish baseline"
    description: "Run the fast loop and full suite on main, record failures if any."
    acceptance_criteria:
      - "Fast loop passes (or failures recorded in run-log.md)."
      - "Full suite passes (or failures recorded in run-log.md)."
    implementation_notes: ""
    status: not_started
    note: ""

  - task_id: T002
    title: "<fill in>"
    description: "<fill in>"
    acceptance_criteria:
      - "<fill in>"
    implementation_notes: ""
    status: not_started
    note: ""

run-log.md

Append-only. Timestamp everything. Tag entries so you can scan it quickly.

# Run log

## Context
- Repo:
- Branch/worktree:
- Start time:

## Events (append-only)
- <YYYY-MM-DD HH:MM> decision: ...
- <YYYY-MM-DD HH:MM> evidence: ...
- <YYYY-MM-DD HH:MM> failure: ...
- <YYYY-MM-DD HH:MM> fix: ...
- <YYYY-MM-DD HH:MM> note: ...

## Follow-ups
- Skills to extract:
- Evals to add:
- Agent guidelines to update (`AGENTS.md`):
- Docs to update:
- Tech debt spotted:
Chapter 13

Execution

Once plan.md and checklist.yaml look good, execution should feel boring.

Execution rules I use:

A default execution loop

1 Pick next task 2 Re-read plan + code 3 Smallest change 4 Verify (fast loop) 5 Commit checkpoint Update & repeat

This is the loop I want the agent to run for hours:

  1. Pick the next not_started task in checklist.yaml.
  2. Re-read plan.md and the relevant code.
  3. Make the smallest change that moves the task forward.
  4. Run the fastest verification that can catch the likely failure.
  5. Commit a checkpoint.
  6. Update the task note with what changed and what was verified.
  7. Append any surprises, learnings, post-mortems from failures to run-log.md.
  8. Repeat.

Milestones:

Chapter 14

Compaction

Every long run eventually hits the context window.

Harnesses handle this with compaction: they summarize history, prune tool output, and keep going. Done well, it feels invisible. Done poorly, the agent forgets constraints and starts making reasonable, wrong decisions.

How to make compaction a non-issue:

If a run goes sideways, do not salvage it for hours. Fork the session and restart from a clean checkpoint.

Chapter 15

Steering

Steering is how you correct the agent mid-run so the next cycle adapts fast.

It works best when you have observability:

My cadence:

What to look for:

When it drifts, steer quickly:

High-leverage steer messages (copy/paste):

Chapter 16

Review

Review is where long-running work becomes shippable.

Agent-first review

Start by asking an agent to review the diff against your acceptance criteria.

In Codex, use /review or codex review with xhigh effort. If you are not on Codex, create a review skill or a review subagent that always does the same workflow.

Example review prompt:

Review this branch against `main`.

Output:
- A risk-ranked list of files and changes (high, medium, low)
- Bugs and correctness issues
- Security issues (auth, permissions, secrets, injection risks)
- Concurrency, retries, and idempotency risks
- Consistency with repo conventions (patterns, style, architecture)
- Missing tests and weak verification
- Migration or rollback risks
- Performance and resource risks
- Suggested follow-up tasks to add to checklist.yaml

Then propose a minimal patch set to fix the top issues.

Human review by risk

After agent review, do a human pass that is biased toward high blast radius areas:

Use CI as an active loop

Open a draft PR early and let CI run repeatedly while the agent continues.

A strong pattern:

This avoids a giant "everything fails at the end" moment and keeps your local machine responsive.

Land it safely

For big refactors and migrations:

Chapter 17

Self-evolution

Every long run produces edge cases. Turn them into permanent leverage.

After the PR lands:

SELF-EVOLUTION CYCLE EVERY RUN IMPROVES SHIP PR Land the work MINE RUN-LOG Extract patterns EXTRACT SKILLS Codify workflows UPDATE AGENTS.MD Refine guidelines ADD EVALS Measure quality NEXT RUN Start stronger Deliverable Retrospective Automation Guardrails Regression Compounding
Section

Scaling up

Chapter 18

Parallel task management

Parallel agents are a force multiplier when the work streams are independent.

Use separate checkouts, not chaos

Give each agent its own checkout.

Two common patterns:

Worktrees:

git worktree add ../repo-agent-refactor -b agent/refactor
git worktree add ../repo-agent-migration -b agent/migration

Separate clones (cleaner isolation, slower setup):

git clone "$(pwd)" ../repo-agent-review
git clone "$(pwd)" ../repo-agent-perf

Rules that keep you sane:

What not to parallelize

Avoid parallelizing interdependent work.

When tasks depend on each other, you get:

The hidden cost is lost learning. During execution, the agent learns repo quirks, test behavior, and edge cases. That context is hard to propagate across parallel runs. If the work is coupled, run it sequentially or force the learning into shared memory files.

Sequential wins here. Let one agent carry the context forward.

Checkpointing via git

Ask the agent to commit early and often.

What you get:

Skill: clean-commit (narrative rebuild)

This comes up constantly in long-running work. Agents can land the right end state with a messy branch. "Clean-commit" is the workflow I use to rebuild the same end state with a commit history you can review and ship.

If the end state is correct but the branch is a mess, rebuild it with a clean commit storyline.

Workflow:

  1. Validate the source branch (no uncommitted changes, up to date with main).
  2. Study the full diff to understand the intended end state.
  3. Create a new branch off main.
  4. Plan a commit storyline (self-contained steps, tutorial-style).
  5. Reimplement commit-by-commit on the clean branch.
  6. Verify the final end state matches the source branch and run the full checks.

Rules:

Tools that help

Use whatever keeps your operating surface simple.

Chapter 19

Running this in a team

Long-running agents work best when the team agrees on a few defaults.

If you do not standardize, every long run becomes a bespoke ops problem.

Section

Troubleshooting and improvement

Chapter 20

Common failure modes (and fixes)

These are the traps that waste the most time. Read this once, then treat it as a reference.

01 No objective verification
Symptoms
  • The agent keeps "making progress" but you cannot prove it
  • You only learn it broke something at the end
Fix
  • Make "done" measurable. Tests, lint, typecheck, CI, screenshots, queries.
  • If the repo does not have a verification loop, build that first.
02 Flaky or slow verification loops
Symptoms
  • The agent keeps retrying tests
  • It takes 30 minutes to learn one thing
Fix
  • Create a fast loop for iteration, then run the full suite at milestones.
  • Run a targeted subset of tests using pattern matching.
    • Example (Gradle): ./gradlew :app:test --tests 'com.yourco.payments.*'
    • Example (Gradle single test): ./gradlew :app:test --tests '*BillingServiceTest'
  • Prefer a persistent test environment when setup time is the bottleneck.
    • Long-running local DB containers
    • Reused service instances in a dev namespace
    • Cached build artifacts and dependencies
  • Watch for disk and memory creep on long runs.
    • Keep an eye on btop or htop
    • Write a small cleanup skill if your test suite drops gigabytes of artifacts (tmp/, logs/, browser profiles, build caches)
    • If Docker state creeps, restart containers. Drop volumes only if you can recreate the data safely.
  • Quarantine flakes. Make them someone's problem, or the agent will burn hours.
03 Tool output floods the context
Symptoms
  • The agent pastes huge logs into chat
  • Compaction triggers constantly and quality drops
Fix
  • Write long outputs to files (logs/, tmp/, artifacts/) and reference them.
  • Teach the agent to summarize, not paste.
04 Context drift after compaction
Symptoms
  • It forgets a constraint and starts changing forbidden areas
  • It redoes work it already did
Fix
  • Use plan.md, checklist.yaml, and run-log.md.
  • Make rereading those files part of the execution loop, especially after compaction.
05 Infinite loops and false progress
Symptoms
  • It keeps trying the same fix
  • It bounces between two approaches
Fix
  • Add a loop breaker rule: after N failed attempts, stop and change strategy.
  • Force a short diagnosis before continuing.
  • If your harness supports stop hooks, use them to prevent "give up" behavior.
06 Oversized diffs
Symptoms
  • One PR has everything
  • Review is impossible
Fix
  • Commit checkpoints early and often.
  • Prefer milestone commits (mechanical refactor first, behavior changes later) or stacked PRs
  • If the end state is right but the commit history is messy, rebuild the branch with a clean narrative history.
Chapter 21

Observability

Long runs fail quietly when you cannot see what is happening.

Make the run observable:

Two rules that keep it sane:

For the post-run loop, see Self-evolution.

Chapter 22

Conclusion

Long-running agents change the shape of engineering work. You spend less time on mechanical execution. You spend more time on direction, constraints, and review.

The dream output is a PR you can ship: green CI, a reviewable diff, and a run-log.md that explains what happened with links to evidence. Mine the run log for repeatable workflows, then turn them into skills.

You get there with boring discipline:

Models, pricing, and limits will keep moving. Build a repeatable system from first principles: the agent makes a change, runs tools to verify, records evidence, and repeats. Keep the run observable so you can direct it when needed. If that loop is solid, switching models is a routing decision. Your process stays the same.

A good run feels calm. The checklist stays accurate. The agent keeps producing evidence. Your interventions are short and early. Landing the PR feels routine.

Chapter 23

Glossary

Harness
the runner around the model (tools, permissions, sessions, compaction, UI).
Build-verify loop
iterative cycle where the agent generates an output, tests or evaluates it against defined criteria, and refines it until it meets the desired goal
Compaction
summarizing and pruning old context so the run can continue past the context window.
Golden examples
a small, human-verified slice that becomes the pattern for a large refactor.
Worktree
a separate checkout of the same git repo so each agent works in isolation.
Steering
corrective instructions mid-run to prevent drift and wasted work.
Skills
reusable procedures that teach the agent how to work in your environment.
MCP
a standard way to connect agents to external tools and data sources.
Ralph loop
a "loop-even-if-you-want-to-stop" pattern primarily used with Claude Code.
Stacked PRs
a chain of small PRs that build on each other so review stays tractable.
Chapter 24

References