A practical handbook of principles and lessons for operating autonomous, long-running, self-improving agents on complex tasks.
plan.md is the blueprint. checklist.yaml is current status. run-log.md is the audit trail. Your agent should refer to them across compaction and restarts.AGENTS.md), and repo evals from run-log.md.run-log.md into skills, repo evals, and agent guidelines (AGENTS.md).Long-running agents are coding agents that can run for hours or days with minimal intervention. You hand them a scoped job, a verification loop, and a memory system. They keep working, checkpointing, and producing evidence until your task is verifiably complete.
Cursor has written about how their agents work continuously over a week on a single codebase. Examples include a web browser built from scratch with 1M+ lines across about 1,000 files, and a Java LSP with 2.5M+ lines across about 10,000 files. They also report peaks around 10M tool calls in a week, with bursts around 1,000 commits per hour. Source: Scaling long-running autonomous coding.
This playbook is an operating model for how I achieved similar results. It is opinionated. It assumes you want throughput and you still want to ship safely: green CI, reviewable diffs, rollback paths, and no surprises in production.
In concrete terms, a long-running agent:
Great fits:
Prefer interactive work when:
Look for:
If you want one default: Codex CLI + GPT-5.2 (high or xhigh).
For long-running agent work, this is the frontier stack right now. It runs tool loops for hours, compaction stays coherent, and it is easy to steer in real time.
The big advantage is in-flight visibility. Codex shows the reasoning trace in real-time so you can redirect early. In my experience this saves a ton of time compared to harnesses where you mostly see output after a long step finishes.
Codex's reasoning trace (thinking blocks) pairs strongly with its steering feature. You can see where the agent is headed mid-flight and cut off bad paths early by sending messages in real-time without interrupting its agent loop.
Use your durable memory system for async observability. It is what you read in the morning, after compaction, and after restarts. Treat it as the system of record for decisions and evidence, even if you have thinking blocks.
Claude Code does not expose internal reasoning the same way today. You can still run long jobs, but you should compensate with shorter execution steps, more checkpoints, and stricter logging.
I use other harnesses for three reasons:
Opus 4.6 is strong and I still use it for interactive work.
For long-running background runs, it is not my default. Cursor put it bluntly: "Opus 4.5 tends to stop earlier and take shortcuts when convenient." Source: Scaling long-running autonomous coding.
While Anthropic's Opus 4.6 release claims better performance at long-running tasks, anecdotally it still gives up more easily compared to GPT-5.2/Codex-5.3 models. Apart from the training, I also suspect its partly because OpenAI allocates a very generous thinking budget on high and xhigh reasoning levels in Codex.
If you are in a Claude-first stack, you can still run long jobs. You will usually want a stronger harness loop (hooks and a "Ralph loop"), plus the same fundamentals in this playbook. More on that in Harnesses.
Most serious harnesses expose a reasoning effort setting like low, medium, high, xhigh.
Practical rule:
high or xhigh.medium or coding-specific models (e.g gpt-5.3-codex-xhigh).Open-weight models are now good enough to matter. With a strict verification loop, they can do a lot of the execution work at a fraction of the cost.
A couple outside takes match what I see:
How I use them:
Two choices matter here:
Pick the harness for workflow. Pick the provider for model access, pricing, caching, and compliance.
A harness is the software around the model. It is the difference between "chat with a model" and "operate an agent".
AGENTS.mdplan.md, checklist.yaml, run-log.md)This is my default harness for long-running coding work.
Why it is best-in-class for long runs:
--yolo when you want full access on a disposable machine./review)Two features I use constantly:
Practical tips:
plan.md and checklist.yaml are crisp.--yolo only when the environment is isolated and disposable.Claude Code is one of the most feature-rich harnesses today.
If you want Claude Code to behave like a long-running runner, you need a "keep going even if you want to stop" loop for yoru agent:
Anthropic documents hooks and the Ralph Wiggum plugin. See Claude Code hooks and Claude Code plugins. Many also roll their own bash loop so they can restart fresh sessions with next context when long runs start to experience context rot.
If you are running Opus for hours, external memory is the crux. Treat plan.md, checklist.yaml, and run-log.md as mandatory, and design your loop to reread them after compaction and restarts.
Cursor has some of the most concrete public writing on long-running agents at scale. Read Scaling long-running autonomous coding even if you do not use Cursor.
The operating model is portable: treat agent runs like running software. Monitor them, recover from errors, and expect loops that can span a full day.
OpenCode is a model-agnostic harness worth understanding.
It has:
opencode serve) with an API for automationDocs: OpenCode.
Pi is a small, extensible terminal harness. It is a good fit when you want a model-agnostic core that you can extend, and you want to bake in your own workflows and memory system.
Why it matters for long-running work:
Pi is also used as an SDK in projects like OpenClaw, which is a good testament to its extensibility. See Pi and OpenClaw.
This space moves fast. The meta-skill is switching stacks without chaos.
My loop:
An eval is a small, repeatable task with a known "done" state. It measures the stuff you care about: repo conventions, tool access, flaky tests, migrations, CI, and constraints.
Benchmarks are useful. They are not enough. Your evals are what you should optimize for.
How to build them:
run-log.md for failures and manual interventions, then turn those into eval tasks.run-log.md and propose new evals.Long-running agents can burn a lot of tokens. Set a budget. Pick a routing strategy. Then run your verification loop often so you do not pay to learn the same thing twice.
If you want to run long background agents today, ChatGPT Pro is the best subscription for it.
Price check: ChatGPT Pro is $200/month as of 2026-02-24. See ChatGPT pricing.
Two real reasons:
In one refactor run that consumed 52M+ tokens (and 2B+ cached tokens), I still had about 35% of my weekly limit remaining on Pro. I was also doing other heavy Codex usage that did not appear in the token count. The limit window reset by the time the job finished, so it had no real impact on my next runs.
These plans evolve. Check the current limits before you commit to a workflow. Some people buy multiple subscriptions to increase headroom. Read the terms and use judgment.
Claude Max has strict five-hour windows and message caps. That is fine for interactive work. It becomes a bottleneck for multi-hour background runs and parallel agents. See Anthropic usage limit best practices.
If you are doing this on a company codebase, a personal flat-rate plan might not be viable. Compliance, billing, and data policies usually push you toward enterprise plans or API usage.
If you are paying per token, do not run everything on the most expensive model. Route:
Concrete examples:
The rule is simple: switching models should not change the verification bar. Keep the same acceptance criteria, evidence, and audit trail.
This is what one big refactor run looked like for me:
| Metric | Tokens |
|---|---|
| Total | 52,146,126 |
| Input | 46,053,693 |
| Cached input | 2,164,821,248 |
| Output | 6,092,433 |
| Reasoning | 3,542,400 |
Cached input is the story here. That is repeated context. Prompt caching changes the economics by an order of magnitude.
Assumptions:
| Option | Input ($/1M) | Cached input ($/1M) | Output ($/1M) | Est cost (this run) | Notes | |
|---|---|---|---|---|---|---|
| ChatGPT Pro | n/a | n/a | n/a | $200.00 | Flat subscription if you stay within limits | |
| GPT-5.2 | 1.75 | 0.175 | 14.00 | $544.73 | OpenAI API pricing | |
| Claude Opus 4.6 | 5.00 | 0.50 | 25.00 | $1,464.99 | Anthropic pricing + prompt caching reads | |
| Kimi K2.5 | 0.60 | 0.10 | 3.00 | $262.39 | Moonshot pricing + context caching | |
| GLM-4.7 | 0.60 | 0.11 | 2.20 | $279.17 | Z.ai pricing (cached storage limited-time free) | |
| GLM-5 | 1.00 | 0.20 | 3.20 | $498.51 | Z.ai pricing (cached storage limited-time free) |
Pricing sources:
Good places to spend:
Places to be stingy:
Long-running agents work because they can run unattended. That also means they can fail unattended.
There are two sane ways to operate.
Use this when the agent is running on your real machine, or it can touch anything you care about.
This mode takes setup. It reduces surprise.
If you want real autonomy, isolate first and then loosen permissions.
--yolo or equivalent so the agent can unblock itself without you babysitting itIf the environment is disposable, the agent can safely do open-ended work like installing CLIs, running scripts, and chasing flaky failures.
Rules I follow:
--yolo--yolo (or similar "dangerous" modes) is for environments that are already isolated.
This mode is often necessary for open-ended jobs. When an agent is stuck, it may need to install tooling, poke at the network, or run cleanup tasks without waiting for you.
Good patterns:
If you go this route, treat the environment as disposable. Assume it may get corrupted.
If you want full autonomy without fear:
Long-running agents succeed when you treat them like production systems.
Non-negotiables:
run-log.md, checklist.yaml, CI, and in-flight visibility (tool traces, plus thinking blocks in Codex) should make it obvious what happened and what is verified.AGENTS.md), and repo-local evals.For long runs, I keep three files at the root of the repo:
plan.md for the blueprint and constraintschecklist.yaml for executable work items and acceptance criteriarun-log.md for an append-only ops log (decisions, evidence, failures, fixes)These files do three jobs:
run-log.md and checklist.yaml into skills, agent guidelines, and repo evals.Tooling is the difference between a plain-text LLM and an autonomous agent.
Skills are small instruction bundles that teach your agent how to do a specific thing inside your world.
There is an emerging "Agent Skills" standard for packaging skills in a portable way. See Agent Skills.
Typical structure:
SKILL.md with YAML metadata (name, description)scripts/ for deterministic automationreferences/ for docs, schemas, and runbooksassets/ for static filesThe key trick is progressive disclosure:
SKILL.md body only when needed.Good skills:
Examples worth having in most repos:
A strong pattern:
If you want a good mental model for skills and progressive disclosure, read Skills and shell tips.
MCP (Model Context Protocol) is a standard way to connect agents to external tools and data sources.
Common uses:
High leverage examples:
Tradeoff: MCP can be heavy. It can bloat the context with tool instructions and verbose outputs.
Use MCP when:
Use CLI tools when:
MCP is useful. It is also easy to reach for too early.
Most harnesses load tool descriptions eagerly at the start of the run so the model knows what is available. This also helps prompt caching. The context cost still counts, even when the tokens are cheap.
Two problems show up fast:
Context tax. Big tool surfaces steal working memory. For browser tooling, it is common to burn about 14k to 18k tokens just describing the toolset. That cost hits before the job starts. (What if you don't need MCP at all?)
Weak composability. A lot of composition happens through inference instead of code. Outputs flow back through the model, then you ask it to stitch the steps together. Long runs get harder to rerun, harder to audit, and harder to keep deterministic. Code and CLI workflows compose cleanly. (Tools: Code Is All You Need)
A code-first setup is boring and it works:
run-log.md.Use MCP when you need real system access with auth, or when the external system is the work (SaaS, tickets, CRM, on-call tooling). Keep servers minimal and outputs structured.
Cloudflare's "Code Mode" is a cool example of where this is heading: a fixed tool surface (search + execute), a typed SDK, and a sandboxed runtime. It is new, so treat it as a signal and watch the space. (Code Mode: give agents an entire API in 1,000 tokens)
Browser automation is a great verification loop when:
Common approaches:
Two rules:
run-log.md.Once you internalize the code-first approach, CLI tools become the default tool layer.
CLI tools are perfect for long-running agents:
If a tool is well known (like gh), the agent will usually figure it out. If a tool is obscure, have the agent write a wrapper script and a short README, then turn it into a skill.
Long runs are won in the planning phase.
Ask the agent to ask you questions until scope is crisp:
A long run starts the same way every time:
plan.md, checklist.yaml, run-log.md.checklist.yaml until the tasks reach terminal states.One rule: do not keep a second plan in chat. The plan lives in plan.md. Execution state lives in checklist.yaml. Evidence lives in run-log.md.
Long-running agents need a self-verifying feedback loop. Pick the loop, then make it explicit in the plan.
Common loops:
If you do not have a loop, build the loop first.
Pattern: red/green TDD. A simple, agent-friendly build-verify loop is: write a test that fails (red), make it pass (green), repeat. It forces the agent to prove progress and prevents plausible-but-wrong changes.
Use "red/green TDD" explicitly in prompts; models understand the shorthand. Source: Simon Willison: Red-green TDD (agentic engineering patterns).
Skills (and MCP if you need data from external systems) are critical to make sure your agent has all the tools for a successful run.
Skills
repo-map, ci-triage, refactor-pattern, verify, clean-commit, review.plan.md so routing is deterministic.checklist.yaml that runs the key skills once and records outputs in run-log.md.MCP
checklist.yaml to confirm each MCP server works, and write results to run-log.md.Golden examples work best when they are encoded into skills.
Workflow:
Now the long-running agent can replicate the pattern safely.
Pre-flight:
main is green, or capture current failures in run-log.md.plan.md.Kickoff:
checklist.yaml with tasks that have real acceptance criteria.During the run:
Landing:
/review in Codex or your review subagent).Copy these into your repo. These files are the memory, observability, and self-evolution layer for long-running agents.
This section has four copy-paste blocks:
plan.md (blueprint and constraints)checklist.yaml (work items and acceptance criteria)run-log.md (decisions and evidence)The plan.md template below borrows ideas from OpenAI's ExecPlan approach for Codex. See Using PLANS.md for multi-hour problem solving.
Paste this into your harness to start a long run. It forces the agent to fill in plan.md, checklist.yaml, and run-log.md at the repo root instead of inventing new structure.
You are going to run a long background job in this repo.
Before writing code:
1) Ask me clarifying questions until you can fill in `plan.md`, `checklist.yaml`, and `run-log.md` at the repo root. Follow the structure already in those files. Do not invent a new format.
2) Create or update those three files at the repo root.
3) Stop and wait for approval.
After I approve:
- Execute `checklist.yaml` until tasks reach terminal states.
- Treat the build-verify loop as the source of truth. Propose the fast loop and the full suite.
- Keep `checklist.yaml` updated as you work. Do not delete tasks. Append new tasks when discovered.
- Record decisions and evidence in `run-log.md` with timestamps.
- After compaction or restarts, reread `plan.md`, `checklist.yaml`, and `run-log.md` before continuing.
- Keep diffs reviewable: small commits, reversible steps, milestone checkpoints.
plan.md# Plan: <project name>
Last updated: <YYYY-MM-DD HH:MM>
This plan is a living document. Keep it self-contained. Assume a new person, or a fresh agent session, will restart from:
- `plan.md` (this file)
- `checklist.yaml` (execution state machine + acceptance criteria)
- `run-log.md` (append-only ops log: decisions, evidence, failures, fixes)
## Purpose / Big picture
In 3 to 6 sentences:
- Why this work matters
- What changes for a user
- How to see it working
## Progress and logs
- Progress is tracked in `checklist.yaml`. Keep it updated as you go.
- Decisions, surprises, and evidence go in `run-log.md` with timestamps.
## Context and orientation
Explain the current state as if the reader knows nothing about this repo.
- System overview:
- Key files (repo-relative paths):
- `path/to/file`: <why it matters>
- Glossary (define any non-obvious term you use):
- <term>: <definition>
## Scope
### Goal
<1 paragraph, concrete and testable>
### Non-goals
- <bullets>
### Acceptance criteria (behavior)
- <bullets that a human can verify>
## Constraints and invariants
- Must not change:
- Safety:
- Compatibility:
- Performance:
## Plan of work (milestones)
Write this as a short narrative. For each milestone, say what exists at the end that does not exist now, and how to prove it.
1. Milestone 1: <name>
- Outcome:
- Proof:
2. Milestone 2: <name>
- Outcome:
- Proof:
## Implementation map
Describe the concrete edits you expect.
- Change:
- File:
- Location (function, module, class):
- What to change:
- Add:
- File:
- What to add:
Keep `checklist.yaml` aligned with this plan. If you discover new required work, add tasks. Do not delete tasks.
## Verification
- Fast loop (iteration):
- `...`
- Full suite (milestones):
- `...`
- Expected outputs:
- <what "green" looks like>
- Evidence to capture in `run-log.md`:
- <commands run, links, screenshots, perf numbers, query results>
## Rollout and rollback
- Rollout steps:
- Feature flags:
- Backfills and migrations:
- Rollback plan:
## Idempotence and recovery
- What can be rerun safely:
- What can fail halfway:
- How to retry safely:
- How to clean up:
## Interfaces and dependencies
Be explicit about:
- New APIs or contracts:
- New dependencies:
- Migrations and compatibility:
## Skills and tools
- Skills to use:
- <skill name>: <why>
- MCP servers (if any):
- <server>: <why>
## Golden examples
- <links to commits/files that define the pattern>
## Open questions
- <questions that block execution>
checklist.yamlAgent checklist format I use in other repos.
summary:
last_updated: "<YYYY-MM-DD HH:MM>"
active_task_id: ""
status: "not_started"
blockers: []
next_actions: []
verification_loop:
fast: []
full: []
instructions: |-
This `checklist.yaml` is the execution plan for the repo. It is the source of truth for what "done" means.
Agent workflow:
- Implement tasks until they reach a terminal state: `complete`, `failed`, or `archived_as_irrelevant`.
- Do not mark a task `complete` until every item in `acceptance_criteria` has been met and verified.
- Verify acceptance criteria with real evidence (commands/tests run, screenshots, API calls, queries).
- Record assumptions, decisions, and verification evidence in each task's `note`, and in `run-log.md`.
- If a task fails after multiple approaches, set status to `failed` and summarize what was tried and why it failed.
- Keep tasks updated as you work: `not_started` -> `in_progress` -> `complete` (or `failed` / `archived_as_irrelevant`).
- Add new granular tasks when new required work is discovered. Do not delete tasks. Archive if irrelevant.
Task schema:
- `task_id`: Stable unique identifier (string).
- `title`: Short summary (string).
- `description`: What to build and why (string; can be multiline).
- `acceptance_criteria`: Verifiable checklist items (list of strings).
- `implementation_notes`: Optional tips or constraints (string; optional).
- `status`: One of `not_started`, `in_progress`, `complete`, `failed`, `archived_as_irrelevant`.
- `note`: Freeform working log for the agent (string). Leave empty until there is something worth recording.
guidelines:
- "Do not change public APIs without updating callers."
- "Do not add dependencies without justification."
- "Prefer small, reversible steps and frequent verification."
- "Do not commit secrets."
tasks:
- task_id: T001
title: "Establish baseline"
description: "Run the fast loop and full suite on main, record failures if any."
acceptance_criteria:
- "Fast loop passes (or failures recorded in run-log.md)."
- "Full suite passes (or failures recorded in run-log.md)."
implementation_notes: ""
status: not_started
note: ""
- task_id: T002
title: "<fill in>"
description: "<fill in>"
acceptance_criteria:
- "<fill in>"
implementation_notes: ""
status: not_started
note: ""
run-log.mdAppend-only. Timestamp everything. Tag entries so you can scan it quickly.
# Run log
## Context
- Repo:
- Branch/worktree:
- Start time:
## Events (append-only)
- <YYYY-MM-DD HH:MM> decision: ...
- <YYYY-MM-DD HH:MM> evidence: ...
- <YYYY-MM-DD HH:MM> failure: ...
- <YYYY-MM-DD HH:MM> fix: ...
- <YYYY-MM-DD HH:MM> note: ...
## Follow-ups
- Skills to extract:
- Evals to add:
- Agent guidelines to update (`AGENTS.md`):
- Docs to update:
- Tech debt spotted:
Once plan.md and checklist.yaml look good, execution should feel boring.
Execution rules I use:
complete, failed, or archived_as_irrelevant).complete until acceptance criteria is met and verified.note fields.run-log.md and in the task note.note fields and in run-log.md.AGENTS.md), and add a repo eval.summary in checklist.yaml updated so a human can see current status in 30 seconds.This is the loop I want the agent to run for hours:
not_started task in checklist.yaml.plan.md and the relevant code.note with what changed and what was verified.run-log.md.Milestones:
complete.Every long run eventually hits the context window.
Harnesses handle this with compaction: they summarize history, prune tool output, and keep going. Done well, it feels invisible. Done poorly, the agent forgets constraints and starts making reasonable, wrong decisions.
How to make compaction a non-issue:
plan.md.checklist.yaml (status, notes, acceptance criteria).run-log.md.If a run goes sideways, do not salvage it for hours. Fork the session and restart from a clean checkpoint.
Steering is how you correct the agent mid-run so the next cycle adapts fast.
It works best when you have observability:
checklist.yaml tells you what it thinks it is doing right now.run-log.md tells you what went wrong and what it learned.My cadence:
What to look for:
When it drifts, steer quickly:
High-leverage steer messages (copy/paste):
plan.md, then continue with the next checklist task only."run-log.md, then change approach."Review is where long-running work becomes shippable.
Start by asking an agent to review the diff against your acceptance criteria.
In Codex, use /review or codex review with xhigh effort. If you are not on Codex, create a review skill or a review subagent that always does the same workflow.
Example review prompt:
Review this branch against `main`.
Output:
- A risk-ranked list of files and changes (high, medium, low)
- Bugs and correctness issues
- Security issues (auth, permissions, secrets, injection risks)
- Concurrency, retries, and idempotency risks
- Consistency with repo conventions (patterns, style, architecture)
- Missing tests and weak verification
- Migration or rollback risks
- Performance and resource risks
- Suggested follow-up tasks to add to checklist.yaml
Then propose a minimal patch set to fix the top issues.
After agent review, do a human pass that is biased toward high blast radius areas:
Open a draft PR early and let CI run repeatedly while the agent continues.
A strong pattern:
This avoids a giant "everything fails at the end" moment and keeps your local machine responsive.
For big refactors and migrations:
Every long run produces edge cases. Turn them into permanent leverage.
After the PR lands:
run-log.md and checklist.yaml and propose:
AGENTS.md)run-log.md current so the next run starts stronger.Parallel agents are a force multiplier when the work streams are independent.
Give each agent its own checkout.
Two common patterns:
Worktrees:
git worktree add ../repo-agent-refactor -b agent/refactor
git worktree add ../repo-agent-migration -b agent/migration
Separate clones (cleaner isolation, slower setup):
git clone "$(pwd)" ../repo-agent-review
git clone "$(pwd)" ../repo-agent-perf
Rules that keep you sane:
Avoid parallelizing interdependent work.
When tasks depend on each other, you get:
The hidden cost is lost learning. During execution, the agent learns repo quirks, test behavior, and edge cases. That context is hard to propagate across parallel runs. If the work is coupled, run it sequentially or force the learning into shared memory files.
Sequential wins here. Let one agent carry the context forward.
Ask the agent to commit early and often.
What you get:
This comes up constantly in long-running work. Agents can land the right end state with a messy branch. "Clean-commit" is the workflow I use to rebuild the same end state with a commit history you can review and ship.
If the end state is correct but the branch is a mess, rebuild it with a clean commit storyline.
Workflow:
main).main.Rules:
brew install try-cli (or gem install try-cli)eval "$(try init)" (put this in your .zshrc)try . agent-refactorUse whatever keeps your operating surface simple.
Long-running agents work best when the team agrees on a few defaults.
plan.md, checklist.yaml, run-log.md).AGENTS.md), and add repo evals.If you do not standardize, every long run becomes a bespoke ops problem.
These are the traps that waste the most time. Read this once, then treat it as a reference.
./gradlew :app:test --tests 'com.yourco.payments.*'./gradlew :app:test --tests '*BillingServiceTest'btop or htoptmp/, logs/, browser profiles, build caches)logs/, tmp/, artifacts/) and reference them.plan.md, checklist.yaml, and run-log.md.Long runs fail quietly when you cannot see what is happening.
Make the run observable:
checklist.yaml is your status page. It should answer: what is active, what is next, what is blocked.run-log.md is your audit trail. It should answer: what changed, what was verified, what went wrong, what got fixed.Two rules that keep it sane:
For the post-run loop, see Self-evolution.
Long-running agents change the shape of engineering work. You spend less time on mechanical execution. You spend more time on direction, constraints, and review.
The dream output is a PR you can ship: green CI, a reviewable diff, and a run-log.md that explains what happened with links to evidence. Mine the run log for repeatable workflows, then turn them into skills.
You get there with boring discipline:
checklist.yaml, with acceptance criteria the agent can verify.plan.md is the blueprint. checklist.yaml is current status and "done". run-log.md is decisions, evidence, and an audit trail. Your agent should refer to them across compaction and restarts.checklist.yaml, run-log.md, and CI.Models, pricing, and limits will keep moving. Build a repeatable system from first principles: the agent makes a change, runs tools to verify, records evidence, and repeats. Keep the run observable so you can direct it when needed. If that loop is solid, switching models is a routing decision. Your process stays the same.
A good run feels calm. The checklist stays accurate. The agent keeps producing evidence. Your interventions are short and early. Landing the PR feels routine.