Ralph Loop | Ralph Loop Blog

How to Inspect and Debug Inside an AI Agent Sandbox

Wed, 27 May 2026 00:00:00 GMT

When an autonomous agent gets stuck, do not guess from the outside. Open a shell in the running sandbox and look. A Ralph sandbox is a normal container once you are inside it, so you find it with sbx ls, shell in with sbx exec -it <name> bash, and debug the agent the same way you debug any other process. This is the field guide to inspecting and debugging an AI agent sandbox: locating the microVM, getting an interactive shell, reproducing the failure by hand, and reading the network log.

When to get inside the sandbox

A loop driven by Ralph does not fail silently. The agent emits an explicit signal when it cannot proceed: <promise>BLOCKED:reason</promise> when it needs human help (exit code 2) or <promise>DECIDE:question</promise> when it needs a decision (exit code 3). A run that hits the iteration cap exits with 1, and a clean finish exits with 0. The promise text tells you what the agent thinks went wrong. It does not always tell you why.

That gap is what a shell closes. You get inside when a task keeps failing the same way, when an install hangs, when tests pass for you but not for the agent, or when you just want to see the state the agent left behind. Because the agent runs in bypass-permissions mode (see running agents in YOLO mode safely for why that is fine inside a microVM), the inside of the sandbox holds the real evidence: the working tree, the logs, the installed packages, and the exact environment the agent saw.

The sandbox is the same boundary described in the pillar guide, how to run AI coding agents in Docker Sandboxes safely. Debugging it does not weaken that boundary. You are stepping into the contained space on purpose, doing your work, and stepping back out.

Find the sandbox you need to debug

You cannot shell into a sandbox you cannot name. Ralph names every sandbox deterministically, so the name is predictable, but you still want to confirm what is actually running.

List everything with sbx ls

Start with the list:

sbx ls

This prints every sandbox on the machine. The one you want follows the Ralph format:

ralph-<agent>-<current-dir>-<hash8>

A project at /Users/me/Work/My App running Claude shows up as ralph-claude-my-app-a1b2c3d4. If you ran more than one agent against the same project, you will see one sandbox per agent, because the agent slug is part of the name. That is deliberate, so a Claude run and a Codex run never stomp on each other’s state.

Ask Ralph for the exact name

You do not have to read the name off a list. Ralph prints it on the Starting Ralph line at launch, and you can ask for it any time:

./ralph.sh --print-name
./ralph.sh --print-name --agent cursor

The --agent flag targets a specific agent’s sandbox for the same project. Capture the value once and reuse it for every sbx command below, so you are not retyping a hash.

Decode the deterministic name

The name carries information, which helps when you have several sandboxes open. The pieces are documented in the repo’s sandbox naming notes:

<agent> is the agent slug, lowercased: claude, codex, copilot, cursor, gemini, or opencode.
<current-dir> is the basename of the project directory, sanitized to [a-z0-9-].
<hash8> is the first eight hex characters of sha256 of the absolute project path.

The path hash is the part that prevents collisions. Two directories both named app, one at ~/Work/app and one at /tmp/app, get different hashes and therefore different sandboxes. So when sbx ls shows two ralph-claude-app-... entries, the trailing hash is how you tell them apart.

Shell into the sandbox with sbx exec

This is the command you came for. Open an interactive shell:

sbx exec -it ralph-claude-my-app-a1b2c3d4 bash

The -it flags allocate an interactive terminal. Now you are inside the microVM with full control, the same as any container. You can install packages, run the test suite, read files, and poke at the environment the agent ran in.

Navigate to the project path

The project is mounted at the same absolute path it has on your host. If your project lives at /Users/you/Work/my-app on the host, it lives at /Users/you/Work/my-app inside the sandbox too. That is on purpose, so tooling, config files, and lockfiles resolve identically and the agent never trips over a path difference.

So the first move after shelling in is usually:

cd /Users/you/Work/my-app

Now your working directory matches the agent’s. Any relative path the agent used resolves the same way for you.

Run the agent CLI by hand

The most useful trick is driving the agent yourself, interactively, from inside the sandbox. You see exactly what it sees, including the permission mode and the network reality:

sbx exec -it ralph-claude-my-app-a1b2c3d4 bash
cd /Users/you/Work/my-app
claude

Swap claude for codex, copilot, cursor, gemini, or opencode depending on which agent that sandbox was built for. Running the CLI by hand is how you tell apart a prompt problem from an environment problem. If the agent stalls the same way interactively, the issue is the task, the prompt, or the repo state. If it works by hand but fails in the loop, the issue is in how the loop set things up, and you can compare the two.

Reattach a stopped sandbox with sbx run

sbx exec works while the sandbox is running. To reattach the exact sandbox Ralph uses, for a manual login or a longer debugging session, use the attach form of sbx run:

sbx run ralph-claude-my-app-a1b2c3d4

There is one sharp edge here, and it trips people up, so state it plainly. The create form, sbx run --name <name> <agent> ., is create-only. Passing --name for a sandbox that already exists fails with an error that says --name can only be used when creating a new sandbox. So you use the --name create form exactly once, before the sandbox exists, and the bare sbx run <name> attach form every time after.

Ralph handles this branching for you. It probes sbx ls before every iteration and emits the create form when the sandbox is missing and the attach form when it exists. That probe is also why a loop self-heals: if you sbx rm a sandbox mid-run while debugging, the next iteration simply creates it again. Re-running ./ralph.sh, ./ralph.sh --login, or ./ralph.sh --ports is therefore safe at any point.

Debug inside: logs, reproduction, and missing tools

Once you have a shell, the sandbox is just an environment. Three moves cover most debugging sessions.

Read the agent’s own memory layer

Ralph keeps its state on disk, not in chat history, because the filesystem and git history are the memory layer. That means everything the agent knows is sitting in the project, ready to read:

cat .agent/logs/LOG.md
ls .agent/history/
cat .agent/tasks.json
git log --oneline -20

.agent/logs/LOG.md is the running log. .agent/history/ holds per-iteration logs, so you can see what happened on the iteration that broke. .agent/tasks.json is the task lookup table, and the git log shows what the agent actually committed. Reading these in order usually tells you whether the agent thrashed on one task, committed something broken, or stalled waiting on a decision. For the wider picture of making a run auditable from the outside, including live output and screenshots, see observability for autonomous coding agents.

Reproduce the failing command

The fastest way to understand a failure is to run the thing that failed. Inside the sandbox, at the project path, run the verification stack the loop assumes:

npm test
npx tsc --noEmit
npx eslint .
npx playwright test

If the agent reported a failing test, run that test directly and read the full output without the loop’s framing. The repo mantra is “if you didn’t test it, it doesn’t work,” and the same applies to your debugging: reproduce the exact failure before you theorize about it. Because you are inside the same microVM the agent used, a failure you reproduce here is the real failure, not a near miss caused by a different machine.

Install missing tooling

Sometimes the agent is blocked because a binary it needs is not installed. You have full root-style control inside the sandbox, so install it and confirm the theory:

apt-get update && apt-get install -y <package>

If an install itself fails, that is almost never a package manager bug. It is the network gate, which is the next section.

Debug the network with sbx policy log

Docker Sandboxes are deny-by-default on outbound HTTP and HTTPS. So the single most common “the agent is stuck” cause is a blocked connection: npm install fails, an API call is refused, or a package download hangs. The gate is doing its job, but you need to see what it blocked.

Read the connection history:

sbx policy log

This shows which outbound connections the sandbox attempted and which the policy allowed or denied. List the active rules to see what is currently permitted:

sbx policy ls

When the log shows a denied host the task legitimately needs, allow that specific domain rather than opening everything:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 registry.npmjs.org

Allow several at once with a comma-separated list, which is the usual case when an install is blocked:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "*.npmjs.org,*.pypi.org,files.pythonhosted.org,github.com"

Changes take effect immediately and persist across sandbox restarts. Resist the urge to debug with the full-open "**" rule and leave it there, because that throws away half the boundary. The disciplined way to build an allowlist that lets installs through while keeping exfiltration out is covered in network policies for AI agent sandboxes. The primary reference for the policy engine is the Docker Sandboxes documentation.

A map of debug entry points

There are only a handful of ways into and out of a sandbox, and they compose cleanly. This is the whole surface you work with.

flowchart LR
  Host["Your machine (host)"]
  subgraph Entry["Debug entry points"]
    List["sbx ls: find the name"]
    Shell["sbx exec -it name bash: interactive shell"]
    Attach["sbx run name: reattach the loop sandbox"]
    NetLog["sbx policy log: connection history"]
    Stop["sbx stop name: end the session"]
  end
  subgraph Sandbox["Sandbox: ralph-claude-my-app-a1b2c3d4"]
    Proj["Project at the same path as the host"]
    Logs[".agent/logs, .agent/history, git log"]
    CLI["Agent CLI you can run by hand"]
    Net["Network gate: deny-by-default"]
  end
  Host --> List
  Host --> Shell
  Host --> Attach
  Host --> NetLog
  Host --> Stop
  Shell --> Proj
  Shell --> Logs
  Shell --> CLI
  Attach --> CLI
  NetLog --> Net

sbx ls gives you the name. sbx exec and sbx run put you inside. sbx policy log explains network failures from the outside. sbx stop ends the session. You rarely need anything else.

Clean up with sbx stop

When you finish a manual debugging session, stop the sandbox you were poking at:

sbx stop ralph-claude-my-app-a1b2c3d4

Stopping targets only that one name. Sandboxes you started for other agents in the same project are left alone. Ralph runs this same command for you through its exit trap when a loop ends, by normal exit, by a double Ctrl+C, or by any path that fires the trap, and the cleanup is guarded so it runs at most once.

Stopping is not deleting. A stopped sandbox can be reattached later with sbx run <name>, which is handy when you want to come back to the same state. When you are truly done and want the microVM gone, remove it explicitly:

sbx rm ralph-claude-my-app-a1b2c3d4

If sbx is not installed at all, Ralph fails fast before any of this with a clear pointer to the Docker Sandboxes getting-started guide rather than a confusing downstream error.

A debugging session, end to end

Putting the moves in order, here is what a real session looks like when a loop reports BLOCKED.

First, get the name and confirm the sandbox is there:

./ralph.sh --print-name
sbx ls

Shell in and go to the project path:

sbx exec -it ralph-claude-my-app-a1b2c3d4 bash
cd /Users/you/Work/my-app

Read what the agent recorded, then reproduce the failure it hit:

cat .agent/logs/LOG.md
git log --oneline -10
npm test

If the failure is a blocked install, check the network log from your host in another terminal, allow the domain, and retry the install inside the sandbox:

sbx policy log
sbx policy allow network ralph-claude-my-app-a1b2c3d4 registry.npmjs.org

If the failure is a task or prompt problem, run the agent CLI by hand and watch it work the task interactively. When you understand the cause, exit the shell, fix the task spec or the prompt on the host, and let the loop pick up where it left off. Ralph re-probes the sandbox on the next iteration, so your fix runs in the same contained environment.

This is the Ralph technique applied to its own failures. The loop, in the tradition of Geoffrey Huntley’s original Ralph writeup, keeps its state on disk so you can always step in, read the evidence, reproduce the problem, and step back out. The sandbox is contained, so debugging it costs you nothing on the host. The worst case stays cheap, and the agent gets back to work.

Frequently asked questions

How do I open a shell inside a running AI agent sandbox?

Find the sandbox name with sbx ls or with ./ralph.sh --print-name, then run sbx exec with the interactive flags and bash, for example sbx exec -it ralph-claude-my-app-a1b2c3d4 bash. You now have full control inside the microVM, so you can navigate to the project path, run tests, install packages, and inspect files like any container.

What is the difference between sbx exec and sbx run for debugging?

Use sbx exec with -it and bash to open an interactive shell in a sandbox that is already running. Use sbx run with the name to reattach the sandbox Ralph uses for a manual login or a longer session. The create form sbx run --name name agent dot is create-only and only works the first time, before the sandbox exists.

Why does an install or API call fail inside the sandbox?

Docker Sandboxes block outbound network by default. Run sbx policy log to see which connections were denied, then allow the specific domain with sbx policy allow network using the sandbox name. Allow only the hosts the task needs, such as registry.npmjs.org or the comma-separated npm, PyPI, and GitHub domains, rather than opening all traffic.

Where does the agent store the logs I should read when it gets stuck?

Ralph keeps its state on disk inside the project, not in chat history. Read .agent/logs/LOG.md for the running log, list .agent/history for per-iteration logs, check .agent/tasks.json for the task table, and run git log to see what was committed. Together these usually show whether the agent thrashed on a task or stalled waiting on a decision.

How do I clean up a sandbox after debugging?

Run sbx stop with the sandbox name to stop only that sandbox; sandboxes for other agents are untouched. Stopping is not deleting, so you can reattach later with sbx run and the name. When you want the microVM gone for good, remove it with sbx rm. Ralph also stops its own sandbox automatically through its exit trap when a loop ends.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Network Policies for AI Agent Sandboxes

Thu, 21 May 2026 00:00:00 GMT

An agent needs some network and not all of it. A Docker Sandbox blocks outbound HTTP and HTTPS by default, so the agent starts with zero reach and you grant exactly the domains a task needs with sbx policy allow network. That is the whole ai agent sandbox network policy model: deny everything, then allowlist the package registries, code hosts, and APIs the work requires, and nothing else. This guide shows the commands, the matching rules, and a practical allowlist you can paste for a normal project. It is the network half of the Docker Sandboxes pillar guide.

Why a sandbox blocks the network by default

Filesystem isolation is only half of a sandbox boundary. The other half is the network. An agent with unrestricted outbound access can fetch arbitrary code, run a curl | bash it found in a README, and in the worst case send your source or secrets to a host you never approved. That last risk is the one that should keep you honest. The agent does not need to be malicious. It is a probabilistic system running shell commands, and a shell command that uploads a file looks exactly like a shell command that downloads one.

So Docker Sandboxes ship with a deny-by-default network posture. All outbound HTTP and HTTPS to domains not on the allowlist is blocked, and all non-HTTP protocols (raw TCP, UDP including DNS, and ICMP) are blocked at the network layer, as documented in the default security posture reference. Traffic to private IP ranges, loopback, and link-local addresses is blocked too, so the agent cannot reach back into your LAN or a cloud metadata endpoint.

The practical symptom is that npm install fails, a pip install hangs, or an API call is refused inside the sandbox. That is not a bug. That is the gate doing its job. You fix it by allowlisting the specific domains the task needs, not by opening everything. The point of the sandbox was to make the agent fearless inside a contained blast radius, and a wide-open network puts a hole in the wall you built.

Docker offers three starting policies, described in the policies documentation:

Open: all outbound traffic allowed, equivalent to a global allow-all rule. No restrictions.
Balanced: default deny with a baseline allowlist covering common AI provider APIs, package managers, code hosts, container registries, and cloud services. You extend it with sbx policy allow.
Locked Down: everything blocked, including model provider APIs like api.anthropic.com. You allow each host explicitly.

Most Ralph users sit on Balanced or stricter. If you chose Balanced, package installs from the big registries often work out of the box and you only add what is missing. If you chose Locked Down, you allow even the model API the agent talks to. Check which one you are on before debugging a blocked request.

How a request flows through the network gate

Every outbound connection the agent makes hits the policy gate before it leaves the sandbox. The gate compares the destination against your allow and deny rules, and if nothing matches, the default policy decides. On a deny-by-default install, nothing matching means blocked.

flowchart TD
  Agent["Agent in YOLO mode"] -->|"outbound request to a host"| Gate{"Network policy gate"}
  Gate -->|"matches a deny rule"| Block["Blocked: connection refused"]
  Gate -->|"matches an allow rule"| Pass["Allowed: leaves the sandbox"]
  Gate -->|"matches nothing (deny-by-default)"| Block
  Pass --> Net["Internet: only allowlisted hosts"]
  Block --> Log["Recorded in sbx policy log"]

The diagram is the mental model. The agent never talks to the internet directly. It talks to a gate, and the gate enforces a list you control from the host. You set that list once for a project and stop thinking about individual requests.

How to allow a domain with sbx policy

Grant access with sbx policy. Changes take effect immediately and persist across sandbox restarts. You target a single sandbox by passing its name, or every sandbox on the machine with the global flag -g. Get the name from sbx ls or from Ralph directly:

./ralph.sh --print-name
./ralph.sh --print-name --agent codex

Ralph builds a deterministic sandbox name in the form ralph-<agent>-<current-dir>-<hash8>, so a project at /Users/me/Work/my-app running Claude becomes ralph-claude-my-app-a1b2c3d4. Use that string wherever a command below shows a name.

Allow one domain for one sandbox

The base command takes a sandbox name and a domain:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 api.example.com

That sandbox can now reach api.example.com. Nothing else changed, and no other sandbox was affected.

Allow several domains at once

Pass a comma-separated list instead of a single host. This is the common case when a package install is blocked and you want to fix it in one command:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "*.npmjs.org,*.pypi.org,files.pythonhosted.org,github.com"

Quote the list so your shell does not try to glob the asterisks. Each entry is matched independently against outbound requests.

Apply a rule to every sandbox with -g

When a domain is something every project on your machine needs (a model provider API, your internal package mirror), set it globally:

sbx policy allow network -g "api.anthropic.com,*.npmjs.org,*.pypi.org"

Global rules are convenient and also a bigger commitment, because they widen the network for sandboxes you have not thought about yet. Prefer per-sandbox rules for anything task-specific and reserve -g for genuinely universal hosts.

The allow-all escape hatch

You can open a single sandbox completely with the double-asterisk wildcard, which opts it out of network filtering:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "**"

Use "**" sparingly and on purpose. It is the right tool when you genuinely cannot enumerate the domains a task needs and you accept the tradeoff for that one sandbox. It is the wrong default, because it throws away the network half of the boundary you set up the sandbox to get. If you find yourself reaching for "**" on every run, that is a signal to capture the real allowlist instead, which the practical section below walks through.

Inspect, log, and revoke

You do not have to guess what the gate is doing. List the active rules:

sbx policy ls

Watch what the agent actually tried to reach, which is the fastest way to turn a blocked install into a precise allow rule:

sbx policy log

And remove access with the mirror of allow. A deny rule blocks a host even inside a broader allow:

sbx policy deny network ralph-claude-my-app-a1b2c3d4 ads.example.com
sbx policy deny network -g telemetry.example.com

The workflow that scales is: run the task, read sbx policy log for the refused hosts, add a tight allow rule for the ones the task legitimately needs, and re-run. After two iterations you usually have the exact allowlist a project needs, and you never touched "**".

How domain matching works

The matching rules are simple once you have seen them, and getting them wrong is the usual reason an allow rule appears to do nothing. The policies reference is the source of truth, and here is the working summary.

Exact domains versus wildcard subdomains

A bare domain matches only itself:

example.com matches example.com on any port. It does NOT match api.example.com.
*.example.com matches subdomains like api.example.com and cdn.example.com. It does NOT match the bare example.com.

To cover both the root and its subdomains, list both patterns:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "example.com,*.example.com"

This trips people up constantly. A registry that serves its index from the apex domain and its tarballs from a CDN subdomain needs both entries, or installs stall halfway through.

Ports

Append a port to scope a rule to one port:

example.com:443 matches requests to example.com on port 443, the default HTTPS port.
*.example.com:443 matches any subdomain on port 443.
example.com with no port matches any port.

Non-HTTP TCP traffic such as SSH is blocked by default and can be allowed by adding a rule for the destination IP address and port, for example sbx policy allow network -g "10.1.2.3:22". UDP and ICMP are blocked at the network layer and cannot be unblocked with a policy rule, so do not expect to ping out or run a custom UDP protocol from inside the sandbox.

Most specific wins, and deny beats allow

When several patterns could match one request, the most specific pattern decides. The order runs from an exact hostname with a port, to an exact hostname on any port, to wildcard patterns (longest match first), to catch-all wildcards, and finally the default policy. That specificity is what lets you block a broad pattern while allowing a narrow exception. You can deny example.com and *.example.com, then allow api.example.com:443, and only that one host and port gets through.

For the case people ask about most: when an allow rule and a deny rule both match a request at the same specificity, the deny rule wins. Deny is the safer default to lose to, and it means you can layer a tight deny over a broad allow without worrying that the allow quietly re-opens the host.

Why this matters: installs versus exfiltration

The reason to bother with any of this is that the two things you want from the network point in opposite directions.

You want package installs to work. An autonomous agent that cannot run npm install, pip install, or cargo fetch is an agent that stalls on its first dependency and burns iterations doing nothing useful. Installs need the registries, the CDNs that serve the actual artifacts, and usually a code host for git dependencies.

You do not want exfiltration to work. The same outbound channel that pulls a tarball in can push a file out. An agent that has read your project (the whole point of the shared workspace) and unrestricted network can, in principle, POST your source or a leaked credential to any host. You are not assuming the model is hostile. You are removing the capability so a confused or prompt-injected agent cannot do the damage even by accident.

A tight allowlist resolves the tension. Installs flow to the registries and code hosts you named, and there is no open path to an arbitrary collection endpoint. The narrower the list, the smaller the surface. This is the same argument the pillar makes for the filesystem, applied to the wire: enforce the boundary from the outside, and keep the grant minimal. For the broader case on why every autonomous run belongs behind a boundary you control rather than a permission prompt the agent can blow past, the Docker Sandboxes pillar is the place to start.

A practical allowlist for a typical project

Here is a concrete starting point for a JavaScript or Python project that pulls dependencies and clones a few git repos. Get the sandbox name, then grant the registries and code host in one command.

For an npm-based project:

./ralph.sh --print-name
sbx policy allow network ralph-claude-my-app-a1b2c3d4 "registry.npmjs.org,*.npmjs.org,github.com,*.githubusercontent.com,codeload.github.com"

For a Python project add the PyPI hosts:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "pypi.org,*.pypi.org,files.pythonhosted.org"

A few notes on why those specific entries:

registry.npmjs.org is the index, and *.npmjs.org covers the related hosts npm reaches during an install. Listing both the apex and the wildcard avoids the half-finished install problem.
github.com handles git operations, codeload.github.com serves tarball downloads for git dependencies, and *.githubusercontent.com serves raw files and release assets. Git installs that work for the clone but fail on the download usually need codeload.github.com.
pypi.org is the index and files.pythonhosted.org is where the wheels and source distributions actually live, so PyPI needs both.

If your agent talks to a model API from inside the sandbox (you are on Locked Down, or you run the CLI by hand in there), allow that host too, for example api.anthropic.com for Claude. On a Balanced policy these are often already covered by the baseline allowlist, so check sbx policy ls before adding duplicates.

Then run the loop and walk away. The agent works in bypass-permissions mode, but the network gate is one of the walls that makes that safe:

./ralph.sh -n 50

When the loop finishes, Ralph stops the sandbox through its exit trap. Your allowlist persists with the sandbox, so the next run starts from the same known-good network posture instead of a blank deny.

Where network policy fits the rest of the boundary

A network policy is one wall, not the whole house. The sandbox also keeps the agent out of your home directory and off your host Docker daemon, which is the filesystem side of the same idea. If you are weighing a Docker Sandbox against a container you wire up yourself, the network gate is a good example of the difference, and Docker Sandboxes versus plain containers for AI agents covers where a hand-rolled setup tends to leak.

When a request is blocked and you are not sure why, get inside and look. The sandbox is a normal container from the inside, so you can reproduce the failing request, read DNS, and watch sbx policy log from the host at the same time. The full routine for that lives in how to inspect and debug inside an AI agent sandbox.

Finally, the network you allow depends on which agent CLI you run and what it phones home to. Each tool reaches a slightly different set of hosts for auth, models, and telemetry, so the allowlist for a Codex run is not identical to a Gemini run. The agentic coding CLIs guide is the cross-hub companion for picking a CLI and knowing what it expects from the wire. Ralph itself is a hackable Bash loop in the tradition of Geoffrey Huntley’s original Ralph technique, so if a policy rule needs to live somewhere repeatable, it is plain shell you can script next to the loop.

Set the allowlist once, keep it minimal, and let the agent run. Installs go through, exfiltration does not, and the gate is something you can read in one sbx policy ls.

Frequently asked questions

What is the default network policy for a Docker Sandbox?

Deny-by-default for outbound HTTP and HTTPS. Any domain not on the allowlist is blocked, and non-HTTP protocols like raw TCP, UDP, DNS, and ICMP are blocked at the network layer. Docker also offers an Open policy that allows everything and a Locked Down policy that blocks even model provider APIs, so check which one your install uses before debugging a refused request.

How do I allow a domain for an AI agent sandbox?

Use sbx policy allow network with the sandbox name and the domain, for example sbx policy allow network ralph-claude-my-app-a1b2c3d4 api.example.com. Pass a comma-separated list to allow several at once, add the -g flag to apply the rule to every sandbox on the machine, and quote any pattern that contains an asterisk so your shell does not expand it.

Why does my allow rule for example.com not match api.example.com?

Because a bare domain matches only the exact host. example.com does not match its subdomains, and the wildcard form star-dot-example.com does not match the bare root domain. To cover both the apex and its subdomains, allow both patterns in one rule, such as example.com and star-dot-example.com.

What happens when an allow rule and a deny rule both match a request?

The most specific matching pattern wins overall, running from an exact hostname with a port, to an exact hostname on any port, to wildcards, to catch-all rules, and finally the default policy. When an allow and a deny match at the same specificity, the deny rule wins, so you can layer a narrow deny over a broad allow safely.

How do I see which hosts the agent tried to reach?

Run sbx policy ls to list the active allow and deny rules, and sbx policy log to see the connection history including blocked attempts. Reading the log is the fastest way to turn a failed install into a precise allow rule, because it tells you exactly which hosts were refused so you can add only those.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Running Agents in YOLO Mode (--dangerously-skip-permissions) Safely

Fri, 15 May 2026 00:00:00 GMT

Yes, you can run an agent with --dangerously-skip-permissions safely, but only when the agent is locked inside a sandbox. The exact same flag is reckless on your laptop and a non-event inside a Docker Sandbox microVM. The flag never changes. What changes is what the agent can reach when it stops asking permission. This post explains what YOLO mode does, why you want it for unattended loops, and the one rule that makes dangerously skip permissions safely an honest phrase instead of a contradiction.

What “YOLO mode” actually means

YOLO mode is any configuration where the agent stops asking you to approve commands and just executes them. No “can I run this?” prompt, no confirmation step, no human in the loop on each shell call. The agent reads a task, decides what to do, and does it.

That is the entire appeal and the entire risk in one sentence. The appeal is speed and autonomy. The risk is that you have removed the only gate between a probabilistic model and a real shell.

Claude Code: —dangerously-skip-permissions

In Claude Code the flag is --dangerously-skip-permissions. There is an equivalent longer form, --permission-mode bypassPermissions, documented in the Claude Code docs. Both do the same thing: the agent skips the per-action approval prompt for the rest of the session.

claude --dangerously-skip-permissions
claude --permission-mode bypassPermissions

The name is refreshingly honest. The word “dangerously” is in the flag because Anthropic wants you to feel a little sick typing it on a machine that has your credentials on it. That instinct is correct. The fix is not to avoid the flag. The fix is to change where you run it.

The same idea in other agents

Every serious agent CLI has its own version of this switch, because every long autonomous run needs it. The Codex CLI has a full-auto mode. Gemini, Cursor, Copilot, and opencode each expose an auto-accept or yolo style setting that turns off interactive approvals. The spelling differs per tool, and you do not have to memorize each one, because Ralph Loop launches each supported agent in bypass-permissions mode for you. Pass any extra agent-specific flags after a -- separator:

./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -- --model pro

Supported agents are claude (the default), codex, copilot, cursor, gemini, and opencode. The common thread across all of them is the same trade. You give up the approval prompt to get an agent that can work without you.

Why you want YOLO mode for unattended loops

Here is the case for turning the prompts off, because it is easy to read “dangerous” and conclude you should always approve commands by hand. You should not, and the reason is structural.

An autonomous loop runs the agent many times. A single iteration of a Ralph loop does real work: it finds the highest-priority incomplete task in .agent/tasks.json, works the steps in the task spec, runs tests, linting, and type checks, commits, and repeats. Across ./ralph.sh -n 50 that is fifty iterations, each one issuing dozens of shell commands. If every command needs your approval, the loop is not autonomous. It is a very slow pair-programming session where you are the bottleneck and you are asleep.

The whole point of running an agent overnight is that you are not there. A permission prompt at 3am is a permission prompt that blocks forever. The agent stalls on the first npm install waiting for a yes that never comes, and you wake up to a loop that did nothing for eight hours.

So unattended runs and per-command approval are mutually exclusive. You either babysit, or you go YOLO. For a long loop, babysitting defeats the purpose, which means YOLO mode is not an edge case. It is the normal operating mode for any agent meant to run while you are away. If you want the deeper version of this argument applied to a single CLI, the run Claude Code in a loop guide walks through the same trade for autonomous Claude Code runs.

The one rule: only go YOLO when the blast radius is contained

Now the rule that makes all of this safe. Run YOLO mode only when the blast radius is contained. The blast radius is everything the agent can touch when it stops asking. Your job is to make that set small, disposable, and external to anything you care about.

A permission prompt is not a security boundary. When the agent asks “can I run this command?” and you click yes, you are the boundary, and you are slow, distracted, and often not awake. When you tell the agent to stop asking, you have not made the agent safe. You have removed the only thing standing between the model and your filesystem. The real fix is to put a wall around the agent that the agent cannot reach through.

On your host, that wall does not exist. An agent running with your user account can read your SSH keys, your ~/.aws/credentials, your browser session cookies, and every other git repository you have checked out. It can push to remotes, delete files, and run a curl | bash it found in a README. None of that requires malice. It is a probabilistic system executing shell commands, and shell commands do not have an undo. For the longer version of why every autonomous run belongs behind a wall, read why you should sandbox every autonomous coding agent.

Inside a Docker Sandbox, the wall is real. A Docker Sandbox is not a plain docker run. It is a lightweight microVM with its own guest kernel, documented in the Docker Sandboxes docs. The agent sees your project directory and a network gate. It does not see the rest of your machine. Run --dangerously-skip-permissions in there and “anything the agent can do” now means “anything it can do inside a microVM that only contains one project directory and a locked-down network.” The dangerous flag becomes a non-event, because the danger had a target on the host and the sandbox removed the target.

This is the inversion worth internalizing. You are not making the agent trustworthy. You are making trust unnecessary by shrinking the blast radius to something you can reset with git or throw away entirely.

flowchart TB
  subgraph Laptop["YOLO on your laptop (host)"]
    HAgent["Agent: --dangerously-skip-permissions"]
    HAgent --> Keys["SSH keys, cloud creds, cookies"]
    HAgent --> Repos["Every other git repo"]
    HAgent --> Net1["Open network: fetch any URL"]
  end
  subgraph Box["YOLO inside a Docker Sandbox"]
    SAgent["Agent: --dangerously-skip-permissions"]
    SAgent --> Proj["Shared project directory only"]
    SAgent --> Gate["Network gate: deny-by-default"]
    SAgent -. "no path to" .-> HostHome["Host home directory"]
  end

Same flag in both boxes. The only difference is the box. On the left, “skip permissions” means skip the last protection your machine had. On the right, it means move fast inside a container that contains nothing precious.

How Ralph runs YOLO mode inside a named sandbox

You can wire this up by hand: create a microVM, launch the agent with the bypass flag, and tear it down when you are done. The point of Ralph is that you do not have to. It computes the sandbox name, checks that the sbx CLI exists, decides whether to create or attach, launches the agent in bypass-permissions mode, and stops the sandbox on exit. Ralph is a hackable Bash loop in the tradition of Geoffrey Huntley’s original Ralph technique, and the YOLO plumbing is part of what it automates.

Answering “Yes” to bypass permissions inside the sandbox

The first time you launch an agent in bypass mode, the CLI shows a one-time warning that asks you to confirm you really want to skip permission prompts. With Ralph, you hit that confirmation inside the sandbox, not on your host, which is exactly where you want to be saying yes. Authenticate and accept the bypass warning in one step with the login action:

./ralph.sh --login
./ralph.sh --login --agent codex

This opens the selected agent inside its correctly named sandbox. You answer the bypass-permissions prompt once, in the microVM, and from then on the agent runs without prompts inside that contained environment. Saying yes to “skip permissions” feels very different when the worst case is a disposable container instead of your home directory.

The deterministic sandbox name is the boundary

Ralph names every sandbox deterministically so the same project and agent pair always reuses the same microVM. The format is:

ralph-<agent>-<current-dir>-<hash8>

The <agent> slug is the lowercased agent name, <current-dir> is the sanitized basename of your project directory, and <hash8> is the first 8 hex characters of a sha256 of the absolute project path. A project at /Users/me/Work/My App running Claude becomes ralph-claude-my-app-a1b2c3d4. The path hash keeps two same-named directories on different paths from colliding, and the agent slug in the name means switching --agent gives you a separate sandbox rather than a surprising state swap.

You never have to memorize it. Ralph prints the name on its startup line, and you can ask for it directly:

./ralph.sh --print-name
./ralph.sh --print-name --agent cursor

That name is the handle for everything else: the sandbox you shell into, the sandbox you allowlist a domain for, and the sandbox Ralph stops on exit.

Create versus attach, every iteration

Each iteration of the loop probes whether the deterministic sandbox already exists. Iteration one usually creates it. Iteration two and onward attach to it. If you manually sbx rm the sandbox mid-run, the next probe simply creates it again. That re-probe is what makes a long YOLO run resilient to you poking at the sandbox by hand.

flowchart TD
  Start["./ralph.sh -n 50"] --> Name["Compute name: ralph-agent-dir-hash8"]
  Name --> CheckSbx{"sbx installed?"}
  CheckSbx -->|"no"| Fail["Exit with docs link"]
  CheckSbx -->|"yes"| Iter["Start iteration"]
  Iter --> Probe{"sandbox exists?"}
  Probe -->|"no"| Create["sbx run --name ... agent ."]
  Probe -->|"yes"| Attach["sbx run name"]
  Create --> Work["Agent works one task in YOLO mode"]
  Attach --> Work
  Work --> Signal{"completion signal?"}
  Signal -->|"COMPLETE"| Done["Exit 0, then sbx stop"]
  Signal -->|"keep going"| Iter

The loop stops on an explicit signal, not a vibe. The agent emits a promise tag: <promise>COMPLETE</promise> when all tasks are done, <promise>BLOCKED:reason</promise> when it needs human help, or <promise>DECIDE:question</promise> when it needs a decision. Those map to exit codes 0, 2, and 3, with 1 reserved for hitting the iteration cap. When the loop ends, Ralph stops the sandbox through its exit trap and hands you back a contained microVM you can inspect or discard.

A safe YOLO run, end to end

Putting the pieces in order, here is what a clean bypass-permissions run looks like. None of these steps put the agent’s hands on your host.

First, install Ralph in your project:

npx @pageai/ralph-loop

Next, authenticate the agent inside its sandbox and accept the bypass-permissions warning once, in the microVM:

./ralph.sh --login

If the task needs network access for installs, allowlist only the domains it needs instead of opening everything. Get the name, then add a rule:

./ralph.sh --print-name
sbx policy allow network ralph-claude-my-app-a1b2c3d4 "*.npmjs.org,github.com"

The network gate matters as much as the filesystem boundary when the agent runs unattended in YOLO mode. A tight allowlist lets installs through and keeps exfiltration out. The full treatment of building that allowlist lives in network policies for AI agent sandboxes.

Then run the loop and walk away. The agent runs with permissions bypassed, but the sandbox is the boundary, so fast and contained are the same thing:

./ralph.sh -n 50

If something looks off mid-run, shell in and look. The sandbox is a normal container from the inside:

sbx exec -it ralph-claude-my-app-a1b2c3d4 bash

When the loop finishes, Ralph stops the sandbox for you. The agent skipped every prompt and ran at full speed, and your host never had its hands on it. This is the default posture Ralph ships, and the broader setup is covered in the pillar guide on running AI coding agents in Docker Sandboxes safely.

Where YOLO mode still bites

A sandbox is a strong boundary and not a magic one. Two limits are worth naming so you do not over-trust the setup.

First, the project directory is shared, which is the entire point, so the agent can absolutely wreck your working tree. The protection there is git, not the sandbox. Commit often, work on a branch, and treat the sandbox as protection for everything outside the project rather than a substitute for version control inside it. A YOLO agent that thrashes on a bad task can still produce a messy diff. The fix is the same as for any agent: clean tasks, frequent commits, and a verification gate of tests, type checks, and linting on every iteration.

Second, whatever you allowlist on the network is genuinely reachable. If you grant a domain that can receive uploads, an agent in YOLO mode could in principle send data there. A "**" allow-all rule throws the whole network surface wide open and undoes half the reason you built the sandbox. Treat network grants like firewall rules: minimal, specific, and reviewed.

Inside those limits, the model is simple and it holds. Skip permissions only when the blast radius is a disposable microVM, enforce the boundary from the outside, and let the agent be fearless where fearless is cheap.

Frequently asked questions

Is it safe to run an agent with --dangerously-skip-permissions?

It is unsafe on your host and safe inside a Docker Sandbox. On the host the flag gives the agent full shell access with no confirmation, so a single bad command can touch your SSH keys, credentials, or other repositories. Inside a microVM sandbox the same flag only grants access to the shared project directory and an allowlisted network, so the worst case is something you can reset with git or throw away.

What is YOLO mode for an AI coding agent?

YOLO mode is any configuration where the agent stops asking you to approve each command and just executes. In Claude Code it is the flag --dangerously-skip-permissions, also written as --permission-mode bypassPermissions. Other agents have their own full-auto or auto-accept setting that does the same thing.

Why do unattended agent loops need bypass-permissions mode?

A long loop issues many shell commands per iteration. If every command needs your approval, the loop blocks the moment you step away and does nothing until you return. Bypass-permissions mode is what lets an agent run overnight without a human approving each action.

Does Ralph Loop run agents in YOLO mode by default?

Yes. Ralph launches each supported agent in bypass-permissions mode because the Docker Sandbox is the boundary. You accept the one-time bypass warning inside the sandbox during ./ralph.sh --login, and every iteration after that runs without prompts inside the contained microVM.

Can a sandboxed YOLO agent still cause damage?

It can damage the shared project directory, since that directory is shared on purpose, so commit often and work on a branch. It can also reach any network domain you allowlist, so keep the allowlist minimal and avoid the allow-all double-star rule. It cannot reach the rest of your host, which is the boundary the sandbox enforces.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Spec-Driven Development vs Vibe Coding: When Each One Wins

Sun, 10 May 2026 00:00:00 GMT

Vibe coding wins when the code is disposable. Spec-driven development wins when the code has to survive: production systems, code other people maintain, and anything an autonomous agent builds while you are asleep. The split is not about taste. It is about whether the work has verifiable acceptance criteria or just a feeling that the output looks right.

This post defines both honestly, including where vibe coding is genuinely the correct choice. Then it covers why an autonomous loop has no choice but to go spec-driven, what vibe coding actually costs once it scales past one person and one afternoon, and how to pick per task instead of picking a religion. It ends with how Ralph leans spec-driven through a PRD and a task list, without forcing you to spec a five line script.

What vibe coding actually is

Vibe coding is writing code by feel. You describe what you want in loose terms, you read the output, you eyeball whether it looks correct, and you keep nudging until it runs. The acceptance test is your gut. The spec lives in your head and changes as you go.

For an AI workflow, vibe coding means prompting an agent with intent and judging the result by reading it. “Add a settings page.” “Make this faster.” “Fix the layout on mobile.” You accept the diff because it looks plausible and the page renders, not because a test confirmed a specific condition.

This is not an insult. Vibe coding is the fastest way to explore an idea you do not fully understand yet. When the goal is to learn the shape of a problem, writing a spec first is premature: you do not know enough to write a good one. You vibe a spike, see what the problem actually wants, then throw the spike away.

Where vibe coding is genuinely fine:

Spikes and prototypes you intend to delete.
One off scripts: a data migration you run once, a quick scraper, a throwaway chart.
Demos and hackathon code where shipping in an hour beats shipping correctly.
Exploratory work where the spec would just be a guess anyway.
Personal tools with exactly one user who is also the author.

The common thread is that nobody inherits the code. There is no future maintainer, no on call engineer, no agent that has to extend it next week. When the blast radius of a mistake is your own afternoon, the overhead of a spec is not worth it. Vibe away.

What spec-driven development actually is

Spec-driven development inverts the order. You write the specification first, then the code exists to satisfy it. The spec is not a wishlist. It is goals (what to build and why), constraints (the stack, the boundaries, what is out of scope), and verifiable acceptance criteria (conditions something can check by running a command and reading the output).

The phases come from GitHub Spec Kit: specify the intent, plan the approach, break it into tasks, then implement and verify. Each phase produces an artifact the next phase consumes. The full version of this workflow, with PRDs, task lists, and breakdown, is laid out in spec-driven development with AI.

The defining property is that “done” is not a feeling. A spec-driven task is done when a specific, checkable condition is true. “Login works” is a vibe. “POST /api/login with a wrong password returns 401 and the body { error: 'Invalid credentials' }” is a criterion a machine can confirm without your opinion. That single difference is what makes the rest of this comparison fall out.

Why an autonomous agent needs a spec

Here is the part that decides the whole debate for anyone running agents in a loop. A human can vibe code because a human carries the unwritten spec. You know the system. You remember the edge case from last month. You ask a teammate when something is ambiguous. An autonomous agent does none of that.

An agent reads what is on disk. When the spec is silent, it does not pause and ask. It invents an answer and proceeds with full confidence. The guess looks fine in the diff and breaks on the case nobody wrote down. Multiply that across a long run and you get a pile of confident, untested code that drifts further from your intent on every iteration.

The Ralph technique makes this concrete. Each iteration starts the agent with a fresh context window, which is what keeps a long run from rotting (the agent losing the plot over a marathon session). The mechanics are covered in what is the Ralph technique, the loop popularized by Geoffrey Huntley in his original Ralph writeup. Fresh context is the reason the loop survives. It is also the reason vibe coding cannot drive it.

Think about what fresh context implies. The agent that starts iteration 30 has no memory of the conversation from iterations 1 through 29. It rebuilds its entire understanding from files: the PRD, the task list, the logs, the git history. There is no “you know what I meant” to fall back on. If the intent is not written down, it does not exist for that iteration’s agent.

So the loop needs the spec for two reasons:

No guessing. Written goals and constraints mean the fresh-context agent reorients to the same target every pass instead of inventing a new one.
A stop condition. The loop ends on an explicit completion signal, not a vibe. The agent emits <promise>COMPLETE</promise> when every task passes its criteria, and ralph.sh exits with code 0. Without verifiable criteria there is nothing for the loop to check, so it can never honestly say it is done.

That is the core argument. You cannot run an agent unattended against “make it good.” You can run it against a task list where every task carries acceptance criteria the verification stack can confirm.

What vibe coding costs at scale

Vibe coding feels free because the cost is deferred. It shows up later, somewhere else, and usually larger. Three costs dominate once the work outgrows a single afternoon.

Rework. Code accepted on a vibe gets rejected on a test you write three weeks later, or worse, by an incident. The fix is rarely a one line change, because the original code encoded a misunderstanding, not a typo. You are not patching a bug. You are re-deriving the requirement that was never written down, then rewriting against it. Specifying first front-loads that thinking when it is cheap.

Drift. Every implicit decision is a decision someone, or some agent, makes for you. Vibe coding scatters those decisions across files and weeks. One place validates email with a regex, another with a library, a third not at all. There was never a source of truth, so the codebase has three answers to one question. With a spec, the answer is written once and every task inherits it.

Unverifiable output. This is the expensive one for AI workflows. If you cannot state how to check that the work is correct, you cannot automate the check, which means a person has to read every diff and judge it by hand. That person becomes the bottleneck. The whole promise of an autonomous loop is that the machine verifies its own work. Vibe coding removes the thing the machine would verify against, so you are back to manual review at the exact moment you wanted to step away.

The repo mantra is blunt: if you didn’t test it, it doesn’t work. Vibe coding at scale is a bet that you will remember every unwritten assumption and that nobody else will touch the code. Both halves of that bet lose over time.

How to pick per task

The honest answer is that you do not pick one approach for your whole life. You pick per task, and the deciding questions are quick.

flowchart TD
  Start["New piece of work"] --> Survive{"Does the code survive past today?"}
  Survive -->|"No, it is a spike or demo"| Vibe["Vibe code it, delete it later"]
  Survive -->|"Yes"| Criteria{"Can you write pass or fail acceptance criteria?"}
  Criteria -->|"Not yet, too fuzzy"| Spike["Vibe a spike first, then spec what you learned"]
  Criteria -->|"Yes"| Agent{"Will an autonomous agent build it unattended?"}
  Agent -->|"No, you are at the keyboard"| Mixed["Spec the risky parts, vibe the glue"]
  Agent -->|"Yes"| Spec["Spec-driven: PRD, tasks, criteria, verify"]

Walk the branches:

Does it survive past today? If the answer is no, stop reading and vibe it. A migration script you run once does not need a PRD.
Can you write pass or fail criteria? If the problem is still fuzzy, you do not know enough to spec well. Vibe a spike to learn the shape, then write the spec from what you learned. The spike was reconnaissance, not the deliverable.
Will an agent build it unattended? This is the hard cutoff. The moment a fresh-context agent is doing the work without you watching, you need the spec. There is no in between, because the agent has no judgment to fall back on and no way to ask you mid-iteration.
You at the keyboard, code that survives? This is the common case, and it is mixed. Spec the parts where a wrong guess is expensive (auth, money, data integrity, public APIs). Vibe the glue where a mistake is cheap and obvious. You do not need acceptance criteria for a button’s hover color.

The mistake is treating this as ideology. Spec-driven purists waste hours writing criteria for code they will delete. Vibe coding diehards ship confident nonsense to production. The discipline is matching the rigor to the stakes.

How Ralph leans spec-driven

Ralph is built for the unattended case, so it leans spec-driven by design. It does not force you to write a PRD for a one line fix, but the loop architecture assumes a spec exists when you run it for real.

The spec lives in two files under .agent/prd/. PRD.md is the full document: goals, constraints, core features, technical stack, security considerations, and assumptions. SUMMARY.md is the short executive overview sent to the agent every iteration so it reorients fast without rereading the entire PRD. Long document for depth, short summary for the working context on each pass.

You do not write all of this by hand. The prd-creator skill, run in plan mode, interviews you one question at a time, researches the problem, and writes both files plus a task list. The details of writing that document, and acceptance criteria an agent can actually verify, are in how to write a PRD an AI agent can actually build from.

The PRD then decomposes into a task lookup table. tasks.json holds the list, and each tasks/TASK-{ID}.json carries its own description, dependencies, and an acceptanceCriteria array. This is the structure that lets a loop grind through hundreds of tasks without losing track, covered in task lookup tables for agents.

npx @pageai/ralph-loop
./ralph.sh -n 50

Each iteration the loop finds the highest-priority incomplete task, works its steps, runs tests and linting and type checking, takes a screenshot, flips the task to passing only when the gate is green, and commits. The verification stack is Playwright, Vitest, TypeScript, ESLint, and Prettier. The agent does not get to declare victory on a vibe. It declares victory when the criteria pass, and the loop stops on the completion promise.

That is the whole point of the comparison in one workflow. The spec is what lets the machine verify its own output, which is what lets you close the laptop. Vibe coding is the right tool when nobody inherits the code. The moment an agent inherits it, the spec stops being overhead and starts being the only thing keeping the loop honest.

Where to go next

If you are deciding how much structure a piece of work deserves, read across the spec-driven cluster:

Spec-driven development with AI for the full Specify, Plan, Tasks, Implement workflow.
How to write a PRD an AI agent can actually build from for the document that sits at the top of it.
Task lookup tables for agents for scaling a spec to hundreds of tasks.

For the loop that reads your spec on every pass and why fresh context changes the rules, read what is the Ralph technique.

Frequently asked questions

What is the difference between spec-driven development and vibe coding?

Vibe coding writes code by feel: you prompt with loose intent, read the output, and accept it because it looks plausible and runs. Spec-driven development writes the specification first, so the code exists to satisfy concrete goals, explicit constraints, and verifiable acceptance criteria. The defining difference is whether done is a feeling or a checkable condition a machine can confirm by running a command and reading the output.

Is vibe coding ever the right choice?

Yes, when the code is disposable and nobody inherits it. Spikes you intend to delete, one off scripts, demos, hackathon code, and personal tools with one user are all fine to vibe. The overhead of a spec is not worth it when the blast radius of a mistake is your own afternoon. Vibe a spike to learn a fuzzy problem, then write the spec from what you learned.

Why do autonomous AI agents need a spec instead of vibe coding?

An autonomous agent reads only what is on disk, and a Ralph loop starts each iteration with a fresh context window that has no memory of previous iterations. When the spec is silent, the agent invents an answer and proceeds with full confidence, and that guess compounds across a long run. Verifiable acceptance criteria give the loop something to check and a real stop condition, so it can honestly signal completion instead of guessing it is done.

What does vibe coding cost once a project scales?

Three things: rework, because code accepted on a vibe encodes a misunderstanding you later re-derive and rewrite; drift, because every implicit decision gets answered differently across files with no source of truth; and unverifiable output, because if you cannot state how to check correctness, a person has to read every diff by hand. That manual review is the exact bottleneck an autonomous loop is supposed to remove.

How does Ralph use spec-driven development?

Ralph keeps the spec in .agent/prd/PRD.md and a short SUMMARY.md sent to the agent each iteration, generated by the prd-creator skill in plan mode. The PRD decomposes into a task lookup table where each TASK-ID.json carries acceptance criteria. Every iteration the loop works one task, runs Playwright, Vitest, TypeScript, ESLint, and Prettier, and only marks the task passing when the gate is green, then commits and stops on a completion promise.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Docker Sandboxes vs Plain Containers for AI Agents

Tue, 05 May 2026 00:00:00 GMT

A plain container is a start, not a boundary. Running an autonomous coding agent inside docker run does isolate it from your host filesystem, which is already better than running the agent directly on your laptop. The gap is that a normal container shares your host kernel, so the wall between an agent executing arbitrary shell commands and your machine is thinner than most people assume. Docker Sandboxes close that gap by giving each sandbox its own microVM with a separate guest kernel, which is the difference this post is about when you compare a docker sandbox vs container ai agent setup.

If you want the full safety argument first, the pillar guide on how to run AI coding agents in Docker Sandboxes safely covers the whole boundary. This post zooms in on one question: what does a hand-rolled container actually give you, where does it leak, and why does a microVM hold where a container does not.

Container vs sandbox, the short answer

The decision comes down to what you are isolating from and how much you trust the workload.

A container is the right tool for software you trust. You built it, you reviewed it, and you are shipping it. Namespaces and cgroups keep processes tidy and resource bounded, and the shared kernel is a feature because it is fast and cheap.

A microVM is the right tool for software you explicitly do not trust to behave. An autonomous agent running in bypass-permissions mode is exactly that. It decides which commands to run, it can curl | bash something it found in a README, and it does not have an undo. For that workload you want a virtualization boundary, not just a kernel namespace.

So the short answer: use a container when the code is yours and reviewed, use a sandbox when a probabilistic system is executing shell commands on your behalf. The rest of this post is the detail behind that rule.

What a hand-rolled container actually gives you

Start with the honest version of the container story, because containers are useful and a docker run is genuinely a step up from nothing.

A container gives the agent its own process namespace, its own mount namespace, its own network namespace, and a cgroup that can cap CPU and memory. The agent sees its own process tree, not yours. It sees the filesystem you mounted, not your whole home directory by default. That isolation is real, and for a lot of workloads it is enough.

Here is a minimal version of what people reach for first:

docker run -it --rm \
  -v "$PWD":/work -w /work \
  node:22 bash

This drops you into a container with your project mounted at /work. The agent can edit those files, run installs, and run tests. Your ~/.ssh and ~/.aws are not mounted, so they are not visible. For a quick experiment that is a reasonable boundary.

The problem is that the defaults of a plain container were designed for running trusted software conveniently, not for caging an untrusted process. So the moment you run a real agent for hours, the soft spots start to matter.

Where the container leaks

There are four leaks worth naming, because each one is a place a long autonomous run can go wrong.

First, root by default. Unless you say otherwise, the process inside the container runs as root (uid 0). It is namespaced root, not host root, but it is still root inside the container, which widens what a container escape or a misconfigured mount can reach. An agent that runs chmod, chown, or installs system packages as root is one surprised step closer to your host than it needs to be.

Second, the shared kernel. This is the big one. Every container on the machine, including the agent’s, talks to the same host kernel you are using. Namespaces are a kernel feature isolating a process from other processes, not a separate operating system. A kernel level bug, a privileged syscall, or a known container escape has your actual kernel as its target. You are trusting the kernel boundary to hold against code you specifically decided not to trust.

Third, bind mounts. The convenient -v "$PWD":/work is a two way door. The agent writes to your real working tree on the host, which is the point, but it also means a bad rm -rf or an overzealous “let me clean this up” reaches your real files, not a copy. People also tend to mount more than they should over time (a config directory here, a credentials file there) and each added mount is another path out of the box. Mounting the Docker socket (/var/run/docker.sock) is the worst case, because a process that can talk to the Docker daemon can start a new container as host root and own the machine.

Fourth, the network. By default a container gets full outbound network access through the default bridge. An agent with unrestricted egress can fetch arbitrary code and, in the worst case, send data out. There is no allowlist unless you build one, and building one with raw containers means writing firewall rules by hand.

None of these are reasons to never use a container. They are reasons to not assume docker run is a security boundary for an untrusted agent. For the broader case on why every autonomous run needs a wall around it, see why you should sandbox every autonomous coding agent.

How Docker Sandboxes differ

Docker Sandboxes (the sbx CLI) are built for the untrusted workload case, and the Docker Sandboxes documentation is the primary source for the internals. Three differences matter most for agents.

A microVM per sandbox

Each sandbox runs inside its own lightweight virtual machine with its own guest kernel, not a namespaced process sharing your host kernel. That single change is what upgrades the boundary from “trust the kernel namespace” to “defeat a virtualization layer.” A process that escapes the agent’s view inside the sandbox lands in a guest kernel that is not your kernel, with a virtual machine monitor between it and your host.

For the agent this is invisible. It still gets a Linux environment, a shell, a filesystem, and the tools it expects. The isolation is underneath, where the agent cannot see it and does not need to.

flowchart TB
  subgraph PlainPath["Plain container"]
    direction TB
    PApp["Agent process, often root"]
    PNS["Namespaces + cgroups"]
    PKernel["Shared host kernel"]
    PHost["Host: files, keys, network"]
    PApp --> PNS --> PKernel --> PHost
  end
  subgraph SbxPath["Docker Sandbox (sbx)"]
    direction TB
    SApp["Agent process in YOLO mode"]
    SGuest["Guest kernel (microVM)"]
    SVMM["Virtual machine monitor"]
    SHost["Host: isolated, project dir shared only"]
    SApp --> SGuest --> SVMM --> SHost
  end

Read the two stacks top to bottom. In the plain container, the agent process reaches the host across one shared kernel. In the sandbox, the agent process reaches the host only after getting through a guest kernel and a virtual machine monitor, and the host shares nothing but the one project directory you pointed it at.

Standalone, no Docker Desktop, seconds to spin up

Docker Sandboxes run as a standalone CLI. You do not need Docker Desktop running to use sbx, which keeps the dependency surface small and makes it easy to script. A sandbox spins up in seconds, so the microVM is not a heavy thing you provision once and babysit. It is closer to the ergonomics of docker run: you ask for it, you get it quickly, you throw it away when you are done.

That speed matters for a loop. A tool like Ralph Loop creates or reattaches a sandbox on every iteration, so a slow boundary would tax every pass. Fast startup is what makes “isolate every run” practical instead of aspirational.

Deterministic names and a network gate

Two more things come for free with the sandbox model that you would otherwise build by hand.

Sandboxes get deterministic names so the same project and agent pair reuses the same microVM. Ralph uses the form ralph-<agent>-<current-dir>-<hash8>, for example ralph-claude-my-app-a1b2c3d4. You list them with sbx ls, shell into one with sbx exec -it <name> bash, and reattach with sbx run <name>. No tracking container IDs by hand.

The network is a gate, not an open door. Docker Sandboxes default to deny-by-default egress, and you allowlist exactly what a task needs:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "*.npmjs.org,github.com"

You can apply a rule globally with -g, and the "**" pattern opens a single sandbox fully when you genuinely cannot enumerate the domains. The full treatment of building an allowlist that lets installs through while keeping exfiltration out lives in network policies for AI agent sandboxes. The point versus a plain container: you get this gate without writing iptables rules.

If you must use a plain container, harden it

Sometimes a microVM is not available in your environment and a container is what you have. You can close most of the obvious leaks with flags. This will not give you a separate kernel, so it is not equivalent to a microVM, but it is a real improvement over a naked docker run.

Run as a non-root user instead of root:

docker run --user 1000:1000 ...

Drop every Linux capability and add back only what you actually need:

docker run --cap-drop=ALL ...

Block privilege escalation so a setuid binary cannot regain capabilities:

docker run --security-opt=no-new-privileges ...

Make the root filesystem read-only and give the agent a scratch space that is not your host:

docker run --read-only --tmpfs /tmp ...

Cut the network entirely when a task does not need it, or attach a tightly scoped network when it does:

docker run --network none ...

Cap resources so a runaway loop cannot starve the host:

docker run --memory 4g --cpus 2 --pids-limit 512 ...

Put together, a hardened run looks like this:

docker run -it --rm \
  --user 1000:1000 \
  --cap-drop=ALL \
  --security-opt=no-new-privileges \
  --read-only --tmpfs /tmp \
  --memory 4g --cpus 2 --pids-limit 512 \
  --network none \
  -v "$PWD":/work -w /work \
  node:22 bash

Two things to keep in mind. Never mount the Docker socket into an agent container, because that hands the agent the daemon and therefore the host. And remember that all of this still rides on your shared kernel, so a kernel level escape defeats every flag above at once. Hardened containers raise the cost of an escape. A microVM changes what an escape even targets.

Why Ralph defaults to sbx

Ralph runs every agent inside a Docker Sandbox by default, and the reasoning is the whole post in one line: the sandbox is the boundary, so the agent is free to move fast inside it. Because the boundary is external and real, Ralph runs agents in bypass-permissions mode (Claude Code’s --dangerously-skip-permissions, with equivalents for other CLIs) without the usual dread. The danger of YOLO mode comes from the target on the host, and the microVM removes the target. The deeper version of that argument is in running agents in YOLO mode safely.

The loop wires the sandbox lifecycle so you never manage it by hand. Each iteration computes the deterministic name, checks that sbx exists, creates the sandbox on the first pass and attaches on later passes, runs the agent on exactly one task, and stops the sandbox through an exit trap when the run ends. The loop stops on an explicit completion promise (<promise>COMPLETE</promise>, <promise>BLOCKED:reason</promise>, or <promise>DECIDE:question</promise>), not on a vibe, and those map to exit codes 0, 2, and 3 with 1 reserved for hitting the iteration cap.

Starting a run is the same regardless of which agent you pick:

./ralph.sh -n 50
./ralph.sh --agent codex -- --model gpt-5.5

Supported agents are claude (the default), codex, copilot, cursor, gemini, and opencode. For the field guide on which CLI behaves best inside a long sandboxed loop, the agentic coding CLIs pillar is the cross-hub companion to this one. The takeaway stands on its own: a plain container is where you start, and a per-sandbox microVM is the boundary you actually want around an autonomous agent.

Frequently asked questions

Is a Docker container enough to sandbox an AI coding agent?

A plain container is a start, not a full boundary. It isolates the process and the filesystem you mount, which is better than running the agent on the host, but it shares your host kernel, often runs as root, and has open outbound network by default. For an untrusted agent running arbitrary commands, a microVM that gives each sandbox its own guest kernel is a stronger boundary.

What is the difference between a Docker Sandbox and a plain Docker container?

A plain container shares the host kernel and isolates processes with namespaces and cgroups. A Docker Sandbox, run with the sbx CLI, gives each sandbox its own lightweight virtual machine with a separate guest kernel and a virtual machine monitor between it and the host. The sandbox also defaults to deny-by-default network egress, which a plain container does not.

Where does a hand-rolled container leak when running an agent?

Four common places: the process runs as root by default, the container shares your host kernel so a kernel level escape reaches your machine, bind mounts write straight to your real files and can expose more than intended, and the default network gives the agent full outbound access. Mounting the Docker socket is the worst case because it hands the agent control of the host.

How can I harden a plain container if I cannot use a microVM?

Run as a non-root user, drop all capabilities with cap-drop ALL, set no-new-privileges, make the root filesystem read-only with a tmpfs scratch space, cap memory, CPU, and pids, and restrict the network. Never mount the Docker socket. These flags raise the cost of an escape but still rely on the shared kernel, so they are weaker than a separate guest kernel.

Why does Ralph default to Docker Sandboxes instead of plain containers?

Because the sandbox is the boundary that makes bypass-permissions mode safe. The microVM removes the target an agent could hit on the host, so the agent can run fast and autonomously while the worst case stays contained. Ralph also automates the lifecycle, computing a deterministic sandbox name and creating, attaching, and stopping the sandbox across loop iterations.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Task Lookup Tables: Scaling Autonomous Agents to Hundreds of Tasks

Thu, 30 Apr 2026 00:00:00 GMT

A flat task list does not scale. Once you push past a few dozen entries, a single file that holds every task plus every step, criterion, and note becomes too big to load into an agent’s context window on every iteration. The fix is a task lookup table: a lean index of tasks that points to a separate detailed spec file per task. The agent reads the small index to decide what to do next, then loads exactly one spec file to do it. That separation is what lets an autonomous loop work through hundreds of tasks without choking on its own backlog.

This post shows how Ralph models that pattern with .agent/tasks.json and .agent/tasks/TASK-{ID}.json, why the split scales, how status and priority selection work each iteration, and how you grow the table over time. It assumes you already understand spec-driven development with AI and want the data structure that makes a long run survivable.

What a task lookup table is

A lookup table is an index. Instead of one giant document that mixes the list of work with the detail of each item, you keep two layers.

The first layer is a flat array where each entry is small: an id, a title, a category, a pointer to the detail file, and a status flag. That is the lookup table. It answers one question fast: what is left to do, and which thing is next.

The second layer is one spec file per task. Each file is the full contract for a single unit of work: description, acceptance criteria, ordered steps, dependencies, complexity, and technical notes. The agent only opens this when it has already decided to work that task.

The reason to split is the same reason a database keeps an index separate from the rows. You scan the cheap thing to find the right record, then you read the expensive thing once. An agent that has to load 200 fully detailed tasks just to pick the next one burns its context window before it writes a line of code.

How Ralph models it: tasks.json plus per-task specs

Ralph keeps the index in .agent/tasks.json and the detail in .agent/tasks/TASK-{ID}.json. The root file is deliberately thin. Each entry carries only what the loop needs to choose a task.

[
  {
    "id": "TASK-1",
    "title": "Verify project prerequisites and access",
    "category": "setup",
    "specFilePath": ".agent/tasks/TASK-1.json",
    "passes": false
  },
  {
    "id": "TASK-2",
    "title": "User table with authentication fields",
    "category": "data-model",
    "specFilePath": ".agent/tasks/TASK-2.json",
    "passes": false
  },
  {
    "id": "TASK-3",
    "title": "POST /api/auth/register creates new user account",
    "category": "api-endpoint",
    "specFilePath": ".agent/tasks/TASK-3.json",
    "passes": false
  }
]

Five fields, nothing more. The specFilePath is the pointer that turns this flat list into a lookup table. The passes flag is the status. Everything heavy lives behind the pointer, in the per-task file.

Here is what one of those spec files holds. This is TASK-3.json, the detail behind the one-line index entry above.

{
  "id": "TASK-3",
  "title": "POST /api/auth/register creates new user account",
  "category": "api-endpoint",
  "description": "Validate input, hash the password, store the user, and return a success response.",
  "acceptanceCriteria": [
    "POST with a valid email and password returns 201 with the user id and email",
    "Invalid email format returns 400 with the error text Please enter a valid email",
    "Password shorter than 8 characters returns 400",
    "Duplicate email returns 409, not a generic 500",
    "The stored password starts with $2b$ and is never the plaintext value"
  ],
  "steps": [
    {
      "step": 1,
      "description": "Add the register route handler",
      "details": "Validate with a zod schema, hash with bcrypt, insert into the users table.",
      "pass": false
    },
    {
      "step": 2,
      "description": "Write Vitest cases for every acceptance criterion",
      "details": "Cover valid registration, invalid email, short password, duplicate email, and the stored hash prefix.",
      "pass": false
    }
  ],
  "dependencies": ["TASK-1", "TASK-2"],
  "estimatedComplexity": "medium",
  "technicalNotes": [
    "Never log passwords, even in error branches",
    "Return 409 on duplicate email rather than a generic 500"
  ]
}

Notice the asymmetry. The index entry for TASK-3 is five lines. The spec file is forty. Multiply that across a real project and the savings are obvious: a 200 task project has a tasks.json of roughly a thousand lines of thin entries, while the detail sits in 200 separate files the agent never loads all at once.

The diagram below is the whole idea. The loop scans the lean table, picks one id, follows its specFilePath, and loads only that file into the working context.

flowchart LR
  Index["tasks.json: lean index, one line per task"]
  Index --> Scan["Scan for highest-priority task where passes is false"]
  Scan --> Pick["Selected: TASK-3"]
  Pick --> Path["Follow specFilePath"]
  Path --> Spec["tasks/TASK-3.json: full spec loaded into context"]
  Spec --> Work["Work the steps, run the gate"]
  Work --> Flip["Set passes true in tasks.json, commit"]
  Index -. "not loaded" .-> Rest["tasks/TASK-4.json ... TASK-200.json"]

The dotted edge is the point. Every other spec file stays on disk, unread, until its turn comes. The agent never pays the token cost of a task it is not working.

Why this scales to hundreds of tasks

The scaling property comes from one fact: the agent only loads the one task it needs. Every iteration of the Ralph loop starts the agent with a fresh context window. It does not carry the previous iteration in chat history. It rebuilds its understanding from files. So the cost of each iteration is whatever the agent reads off disk, not the size of the whole project.

Walk through the math. Suppose the average per-task spec is 40 lines and the project has 200 tasks. If the agent had to load every detailed task to orient itself, that is 8000 lines on every single iteration, repeated 200 times. With a lookup table, each iteration loads the thin index (around 1000 lines of one-line entries) plus exactly one spec file (40 lines). The detail you load grows by one file, not by the entire backlog, no matter how large the table gets.

That is why a flat, fully detailed task file hits a wall. It works fine at ten tasks. At a hundred it crowds out the actual code in the context window, and the agent starts skimming, missing acceptance criteria, and contradicting earlier decisions. Splitting the index from the detail keeps the working context flat as the project grows.

The token economics reinforce this. Tokens in the context window cost money and attention on every pass. You do not want to spend that budget reloading 199 tasks the agent will not touch this iteration. A lean index plus one spec is the minimum the agent needs to choose work and do it correctly. This is the same instinct behind keeping a short SUMMARY.md for the PRD instead of resending the full document every iteration: load the index constantly, load the detail on demand.

Status tracking with the passes flag

Status lives in two places, and the split mirrors the index-and-detail structure.

At the index level, each entry in tasks.json has a single passes boolean. It starts false. The agent only flips it to true after the work is built and verified. The loop reads this flag to decide what is finished and what remains. Scanning for incomplete work is a pass over the lean index, which stays cheap even at hundreds of entries.

At the detail level, each step inside a spec file has its own pass boolean. These track progress within a single task: step one done, step two not yet. A task is not complete until every step passes and the acceptance criteria hold. Only then does the top-level passes in the index flip.

The rule that protects this system is strict: never flip passes to true until the work is verified. Tasks are generated with passes: false and stay false until the agent runs the verification stack the loop assumes, which is Playwright for end to end, Vitest for unit tests, TypeScript for types, ESLint for lint, and Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work. A status flag that flips on a vibe instead of a passing gate turns the lookup table into a lie, and a fresh-context agent will trust that lie on the next iteration.

Because status is just a flag in a file, the run is resumable and auditable. Any fresh agent on any machine can read tasks.json, see which entries are still false, and pick up exactly where the last one stopped. Watching those flags flip across a run, alongside the per-iteration history and screenshots, is the basis of observability for autonomous coding agents. The table is both the work queue and the progress report.

Priority selection each iteration

Picking the next task is a scan over the index, and the order is not arbitrary. Each iteration, the loop finds the highest-priority incomplete task in tasks.json, then loads that one spec file and works its steps.

Two things shape the selection.

First, dependencies. Each spec file carries a dependencies array of task ids that must finish before it can start. TASK-3 (the register endpoint) lists ["TASK-1", "TASK-2"], so the loop will not select it until the prerequisite gate and the users table are both passes: true. The agent never builds on a foundation that is not there yet.

Second, the prerequisite gate. TASK-1 is always reserved for prerequisite verification: environment variable placeholders exist, database access works, required tools are authenticated, and any open gaps have an explicit proceed or block decision. Every downstream task that needs those prerequisites lists TASK-1 as a dependency. You do not want an agent discovering halfway through a 200 task run that it never had database credentials.

flowchart TD
  Start["Fresh context"] --> Read["Read tasks.json index"]
  Read --> Filter["Filter to passes false"]
  Filter --> Deps{"Dependencies satisfied?"}
  Deps -->|"no, skip"| Next["Try next candidate"]
  Next --> Deps
  Deps -->|"yes"| Select["Select highest-priority task"]
  Select --> Load["Load its TASK-ID.json spec"]
  Load --> Build["Work the steps"]
  Build --> Gate["Tests, lint, types, screenshot"]
  Gate -->|"fail"| Build
  Gate -->|"pass"| Update["Set passes true, commit"]
  Update --> Stop["Stop. Next iteration starts clean"]

The loop completes exactly one task per invocation, commits, and stops. It never batches. That discipline is what keeps a long run reliable, and the reasoning behind it is covered in one task per iteration. The lookup table is what makes one-task-per-iteration cheap to execute: selection is a quick scan of flags and dependencies, not a re-read of the entire project.

You control how many of these iterations run. The default is 10. Run more with a flag.

npx @pageai/ralph-loop
./ralph.sh -n 50

If the table has 200 tasks and you cap the loop at 50 iterations, the run stops at the cap with exit code 1 (MAX_ITERATIONS) and the remaining tasks stay false in the index, ready for the next run to resume. Nothing is lost. The lookup table is the durable record.

Adding tasks over time with the prd-creator skill

A lookup table is not a frozen document. You grow it as the project grows, and you do not hand-edit a 200 entry JSON file to do that.

Ralph ships a prd-creator skill that turns unstructured requirements into a PRD plus a task list. The first time you run it, it interviews you, writes .agent/prd/PRD.md and .agent/prd/SUMMARY.md, then generates tasks.json with one TASK-{ID}.json spec per task. TASK-1 is always the prerequisite gate, and every task is initialized with passes: false. For a typical project that is dozens to hundreds of entries, not five, because the skill keeps tasks small: anything too complex to finish in a short sitting gets split.

When you want to add a feature or fix a bug later, you run the skill again. It updates the PRD and appends new tasks to the index, each with its own spec file, each starting false. The completed entries keep their passes: true status, so the loop ignores them and works only the new false ones. The table grows without disturbing the finished work.

Use the prd-creator skill in plan mode. Add a password reset flow to the
existing PRD and append the new tasks to .agent/tasks.json with one spec
file each under .agent/tasks/.

Run it in plan mode, where the agent is read-only and asks questions instead of writing code. The skill decomposes the new feature into atomic tasks the same way it did the first batch, which is its own discipline covered in breaking a PRD into atomic agent tasks. The result is a lookup table that accretes work over the life of the project while staying scannable, because the index entries stay thin no matter how many you add.

Where to go next

If you are building the spec that drives a long run, read across the spec-driven cluster:

Spec-driven development with AI for the full Specify, Plan, Tasks, Implement workflow the lookup table sits inside.
Breaking a PRD into atomic agent tasks for decomposing work into the packets that fill the table.
One task per iteration for the rule that makes the lookup table cheap to execute.

For watching the table fill in across a run, with logs, history, and screenshots, read observability for autonomous coding agents.

Frequently asked questions

What is a task lookup table for an AI agent?

It is a two-layer structure that separates the list of work from the detail of each item. The first layer is a lean index where each entry has an id, title, category, a pointer to a detail file, and a status flag. The second layer is one spec file per task with the full description, acceptance criteria, steps, and dependencies. The agent scans the cheap index to pick the next task, then loads only that one spec file to do the work.

Why does a flat task list stop scaling for autonomous agents?

A flat list that holds every task plus all of its detail gets too large to load into the context window on every iteration. At ten tasks it is fine. At a hundred it crowds out the actual code, so the agent skims, misses acceptance criteria, and contradicts earlier decisions. A lookup table keeps the working context flat because each iteration loads the thin index plus exactly one detailed spec, no matter how many tasks exist.

How does Ralph store tasks on disk?

The index is .agent/tasks.json, a flat array where each entry has id, title, category, specFilePath, and a passes flag. The detail for each task lives in .agent/tasks/TASK-{ID}.json with description, acceptance criteria, ordered steps, dependencies, estimated complexity, and technical notes. The agent reads these files fresh on every iteration, so the filesystem is the memory rather than the chat history.

How does the agent choose which task to work next?

Each iteration it scans tasks.json for the highest-priority entry where passes is false and whose dependencies are already satisfied. TASK-1 is always prerequisite verification, and downstream tasks list it as a dependency, so feature work never starts before access and environment are confirmed. The agent loads that one spec file, works the steps, runs the verification gate, then flips passes to true and commits.

How do I add more tasks to the lookup table over time?

Run the prd-creator skill again in plan mode. It updates the PRD and appends new tasks to tasks.json, each with its own spec file under .agent/tasks/ and each initialized with passes set to false. Completed entries keep their passes true status, so the loop ignores them and works only the new false ones. The table grows without disturbing finished work and stays scannable because the index entries stay thin.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Why You Should Sandbox Every Autonomous Coding Agent

Sat, 25 Apr 2026 00:00:00 GMT

Sandbox every autonomous coding agent because the agent runs with your permissions. When you start an agent on your laptop, it does not run as some restricted service account. It runs as you, with your user, which means it can read your SSH keys, your cloud credentials, and your full git history, and it can execute any shell command those permissions allow. A sandbox moves all of that out of reach, so the worst case becomes a throwaway environment you reset instead of a credential leak you cannot undo. That is the entire argument for why sandbox ai agent work, and the rest of this post is the detail behind it.

An agent runs as you, with everything you can touch

Think about what your own user account can do on a normal dev machine. You can read ~/.ssh/id_ed25519, the private key that authenticates you to GitHub and every server you SSH into. You can read ~/.aws/credentials, ~/.config/gh/hosts.yml, .env files scattered across projects, and the browser session cookies sitting in your home directory. You can git push --force, delete directories, and run a curl | bash line you copied from a README.

An autonomous agent inherits all of it. The model is not malicious. It is a probabilistic system that emits shell commands, and shell commands do not ask for intent. When the agent decides to “clean up some old files” or “reset the environment to a known state,” there is no separate permission layer protecting your home directory. The agent is you, and you have access to everything.

This is fine when a human drives each command and reviews it. Autonomy removes the human from that loop on purpose. The whole point of a Ralph-style loop is that the agent works for hours without you watching, picking up the next task in .agent/tasks.json, running commands, and committing. The thing that made interactive use safe (you reading each command before approving it) is exactly the thing autonomy deletes.

So the question is not whether the agent will eventually run a command you would not have approved. Over a long enough run, it will. The question is what that command can reach when it does.

The blast radius problem with YOLO mode on your laptop

To run unattended, an agent has to stop asking permission. In Claude Code that switch is --dangerously-skip-permissions (also --permission-mode bypassPermissions, documented in the Claude Code docs). Other CLIs ship their own version of the same idea, usually called YOLO mode. The naming is honest. You are telling the agent to execute whatever it decides, with no confirmation, for the entire run.

On your host that is a genuinely bad idea, and the danger scales with the length of the run. One iteration is a handful of commands. A fifty-iteration overnight run is hundreds of commands, each one a chance for a hallucinated path, a destructive cleanup, or a misread instruction. You will not be awake to catch the one that matters.

The damage is not limited to the project either. Because the agent runs as you, a single bad command can reach far outside the directory you pointed it at:

It can read secrets from other projects and from your home directory, then paste them into a file, a log, or an outbound request.
It can push to remotes you have credentials for, including repositories that have nothing to do with the current task.
It can delete or rewrite files anywhere your user can write, with no undo.
It can install and execute arbitrary code pulled from the network.

None of these require the model to “go rogue.” They are the normal capabilities of your shell, exposed to a system running without a gate. Running bypass-permissions mode directly on a host is the failure mode, not a feature. For the safe version of that same flag, see running agents in YOLO mode safely.

flowchart LR
  subgraph NoSandbox["Without a sandbox: blast radius is your whole machine"]
    AgentA["Agent in YOLO mode (runs as you)"]
    AgentA --> Keys["SSH keys, cloud creds, cookies"]
    AgentA --> Repos["Every other git repo"]
    AgentA --> Net["Unrestricted network"]
    AgentA --> Proj1["Project directory"]
  end
  subgraph WithSandbox["With a sandbox: blast radius is one disposable microVM"]
    AgentB["Agent in YOLO mode (contained)"]
    AgentB --> Proj2["Project directory (only this)"]
    AgentB --> Gate["Network gate: deny-by-default"]
    AgentB -. "no path to" .-> HostKeys["Host keys and creds"]
    AgentB -. "no path to" .-> HostRepos["Other repos"]
  end

The two pictures use the same agent in the same mode. The only thing that changes is what “anything it can do” resolves to. On the left it resolves to your machine. On the right it resolves to a microVM you can throw away.

Why a sandbox makes autonomy acceptable

A sandbox does not make the agent trustworthy. It makes trust unnecessary. You stop trying to predict every command the model might emit and instead make the set of things any command can reach small and disposable. That inversion is what lets you walk away from a running agent and sleep.

Concretely, a sandbox shrinks the worst case from “the agent leaked my SSH key and force-pushed to a client repo” to “the agent trashed a throwaway environment, so I reset it and re-ran the loop.” The first outcome is a security incident. The second is a Tuesday. Same model, same flags, completely different cost when something goes wrong.

This is also why a sandbox is the precondition for real autonomy rather than a nice-to-have. Long unattended runs are the entire premise of running an AI coding agent overnight, and you cannot responsibly leave an agent running for hours unless the place it runs is one you can afford to lose. The sandbox is what makes the overnight run a reasonable decision instead of a gamble with your credentials.

There is a structural point underneath this. A permission prompt is not a security boundary, it is a question. The only real boundary is one enforced from outside the agent, by something the agent cannot reach through or talk its way around. A sandbox is that external boundary. The full version of this argument lives in the pillar guide, how to run AI coding agents in Docker Sandboxes safely.

How Ralph uses Docker Sandboxes by default

Ralph does not bolt on isolation as an option you remember to enable. It runs every agent inside a Docker Sandbox by default, using the sbx CLI. Each agent gets a lightweight virtual machine with its own kernel, which is a stronger boundary than a namespaced process sharing your host kernel. The Docker Sandboxes documentation is the primary source for how the microVM works underneath.

You do not manage that lifecycle by hand. The script computes a deterministic sandbox name, checks that sbx is installed, decides whether to create or attach, runs the agent in bypass-permissions mode inside the microVM, and stops the sandbox when the run ends. Because the sandbox is the boundary, the agent is free to skip permission prompts and move fast inside it.

Installing and running looks like this:

npx @pageai/ralph-loop
./ralph.sh -n 50

The first command drops Ralph into your project. The second runs the loop for fifty iterations, each one starting the agent with a fresh context inside the sandbox. Pick a different agent and pass agent-specific flags after a -- separator:

./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -n 5 -- --model pro

Supported agents are claude (the default), codex, copilot, cursor, gemini, and opencode. The default iteration count is 10, and --once runs exactly one iteration.

The sandbox name is deterministic so the same project and agent pair always reuses the same microVM:

ralph-<agent>-<current-dir>-<hash8>

<agent> is the agent slug, <current-dir> is the sanitized basename of the project directory, and <hash8> is the first eight hex characters of a sha256 of the absolute project path. The path hash keeps two same-named directories on different paths from colliding. You never have to memorize the name, because Ralph prints it on startup and on demand:

./ralph.sh --print-name
./ralph.sh --print-name --agent codex

When the run ends, by normal exit, by a double Ctrl+C, or by any path that fires the exit trap, Ralph stops only the sandbox it started:

sbx stop ralph-claude-my-app-a1b2c3d4

Stopping is not deleting, so you can reattach later to inspect what the agent did, then remove it with sbx rm when you are finished.

What good isolation actually looks like

Not every “sandbox” is a real boundary. A bare container that bind-mounts your whole home directory and has open network is theater. Three properties separate isolation that holds from isolation that only looks like it does.

Mount only the project, nothing else

The agent should see the project directory and nothing above it. Ralph shares your project at the same absolute path it has on your host, so tooling, config, and lockfiles resolve identically, while the rest of your home directory stays outside. No SSH keys, no ~/.aws, no unrelated repositories, no shell history full of tokens. The blast radius becomes the project you pointed at, plus whatever network you explicitly allow.

The corollary matters: because the project is shared, the agent can absolutely wreck your working tree. The protection there is git, not the sandbox. Work on a branch and commit often. The sandbox protects everything outside the project, and version control protects the code inside it.

Ephemeral by design

The environment has to be cheap to destroy and recreate. If resetting a contaminated sandbox is a multi-hour chore, you will avoid resetting it, and a sandbox you never reset slowly turns back into a pet you are afraid to lose. Ralph leans on the deterministic name here: tear a sandbox down with sbx rm, and the next iteration simply probes, finds nothing, and creates a clean one. That re-probe each iteration is what makes a long run resilient to you poking at the sandbox by hand.

Network limited, not wide open

Filesystem isolation is half the boundary. The other half is the network, because an agent with unrestricted outbound access can fetch arbitrary code and, in the worst case, send data out. Docker Sandboxes default to blocking outbound HTTP and HTTPS, then you allowlist exactly what the task needs:

sbx policy allow network ralph-claude-my-app-a1b2c3d4 "*.npmjs.org,github.com"

The practical symptom of deny-by-default is that npm install fails or an API call is refused until you grant the domain. That is the gate doing its job. You open specific hosts, not everything. There is a full-open escape hatch with the "**" rule, and it is the right tool only when you genuinely cannot enumerate the domains and you accept the tradeoff for that one sandbox. Building a tight allowlist that lets installs through while keeping exfiltration out is its own topic, covered in network policies for AI agent sandboxes.

A plain container can satisfy some of this and miss the rest, which is why the kind of sandbox matters. For where a hand-rolled container leaks and where a microVM holds, read Docker Sandboxes vs plain containers for AI agents.

A quick test before you run an agent unsandboxed

If you are tempted to skip the sandbox for a “quick” autonomous run, ask one question: if this agent ran rm -rf ~ or piped the contents of ~/.ssh to a pastebin, what would it cost you? If the honest answer is “a rebuild and some annoyance,” you are already in a disposable environment and you are fine. If the answer involves rotating credentials, notifying anyone, or the phrase “client repository,” you need the boundary before you start the loop.

The reason to make this the default rather than a judgment call is that the judgment is the part autonomy removes. You are not going to evaluate the risk of each of the next four hundred commands. You evaluate the environment once, up front, and then let the agent be fearless inside it.

Where the boundary ends

A sandbox is a strong boundary and not a magic one. Two limits are worth stating so you do not over-trust the setup.

First, the shared project directory is genuinely shared. The agent can corrupt your working tree, and only git will save you. Branch and commit.

Second, whatever you allowlist is genuinely reachable. Grant a domain that accepts uploads and an agent could in principle send data there. Keep network grants minimal, specific, and reviewed, the same way you would treat firewall rules, and avoid the "**" rule except on purpose.

Inside those limits the model is simple and it holds. Enforce the boundary from the outside, give the agent a disposable place to be fearless, and let the loop run. The sandbox is the blast radius, so fast and autonomous becomes the same thing as contained.

Frequently asked questions

Why do I need to sandbox an AI coding agent at all?

An autonomous agent runs with your user permissions, so it can read your SSH keys, cloud credentials, and git history, push to your remotes, and delete files anywhere you can write. In autonomous mode there is no human reviewing each command, so over a long run the agent will eventually execute something you would not have approved. A sandbox limits what any command can reach to a disposable environment.

What is the worst case if I run an agent without a sandbox?

The worst case is a security incident rather than an inconvenience. A single bad command can leak credentials from your home directory, force-push to an unrelated repository, or delete files with no undo, because the agent has your full permissions. With a sandbox the same bad command only touches a throwaway microVM that you reset and re-run.

Is bypass-permissions or YOLO mode ever safe?

It is unsafe on your host and safe inside a sandbox. On the host, --dangerously-skip-permissions gives the agent full shell access with no confirmation. Inside a Docker Sandbox microVM the same flag only grants access to the shared project directory and an allowlisted network, so the danger has no target on your machine.

Does Ralph sandbox agents automatically?

Yes. Ralph runs every agent inside a Docker Sandbox by default using the sbx CLI. It computes a deterministic sandbox name, creates or attaches the microVM each iteration, runs the agent in bypass-permissions mode inside it, and stops the sandbox when the run ends.

What makes isolation good rather than just theater?

Good isolation mounts only the project directory and nothing above it, stays ephemeral so it is cheap to destroy and recreate, and limits the network to an allowlist instead of leaving it wide open. A microVM with its own kernel is a stronger boundary than a container that shares your host kernel, especially for code you do not trust to behave.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

One Task Per Iteration: The Rule That Makes Autonomous Agents Reliable

Tue, 21 Apr 2026 00:00:00 GMT

The most reliable rule for autonomous coding is also the most boring: one task per invocation, commit, then stop. Each iteration the agent picks the single highest-priority incomplete task, works it to done, verifies it, commits, and exits. It never batches two tasks into one run. Break that rule and a loop that should grind cleanly through fifty tasks turns into a branch full of half-finished, hard-to-review work.

This rule is the operational core of a Ralph loop, and it sits one level below the spec-driven development workflow that produces the task list in the first place. The spec decides what the agent builds. This rule decides how it builds it without losing the plot.

The rule: one task, one commit, one stop

Stated as plainly as it goes in the repo’s own AGENTS.md: one task per invocation. When working from .agent/tasks.json, the agent completes exactly one task, commits, and stops. It never batches multiple tasks.

That is the whole rule. Three verbs: complete, commit, stop. The reason it is worth a thousand words is that every instinct (yours and the model’s) pushes the other way. You have fifty tasks and an agent that can clearly do more than one. Letting it power through feels efficient. It is not. It is the fastest way to turn a clean autonomous run into a debugging session.

An iteration that follows the rule produces one self-contained unit of progress: a feature built, its tests passing, its types checked, and a single commit that maps to exactly one task in the list. The next iteration starts from a known-good state. That property, every iteration ends in a verified checkpoint, is what makes long runs survivable.

Why batching tasks wrecks an agent loop

Batching means telling the agent (or letting it decide) to knock out three or five tasks before it commits. It looks like a throughput win. In practice it degrades the run in three compounding ways.

Context bloat and context rot

A coding agent has a finite context window and finite attention inside it. The longer a single session runs, the more it fills with intermediate state: files it opened, tests it ran, dead ends it explored, decisions it half-remembers. This is context rot. The agent starts contradicting earlier choices, re-editing files it already finished, and forgetting which of the five tasks it was actually on.

One task per iteration keeps the working context small and on-topic. The agent reads the summary and the one spec it needs, builds the one thing, and exits before the window gets crowded. Batching does the opposite: it lets the context grow across multiple unrelated tasks, which is precisely the condition under which agents start producing confident nonsense. Context rot is one of the headline Ralph loop failure modes, and batching is the most direct route to it.

Half-finished work piles up

When an agent juggles several tasks at once, a failure on task four can leave tasks one through three in a partial state. It edited shared files, started a refactor it did not finish, and now nothing in the batch is cleanly done. You cannot ship any of it, and you cannot easily tell where the good work ends and the broken work begins.

One task per iteration makes failure local. If the agent cannot finish the current task, it emits a <promise>BLOCKED:reason</promise> and stops, and everything committed before that point is intact and verified. You lost one iteration, not five tasks of tangled progress.

Commits get impossible to verify

A commit that contains one task is reviewable. The diff maps to a single entry in tasks.json, the acceptance criteria for that task tell you what to check, and the tests in the diff prove it. A commit (or worse, a single giant commit) that contains five tasks is archaeology. You cannot bisect it, you cannot revert one piece of it, and you cannot read why any single change exists six months later.

Clean, atomic commits are not a nicety here. They are the audit trail that makes you trust a diff you did not write. Batching trades that trail for a vague sense of speed you do not actually get, because the review and rework cost lands on you in the morning.

Fresh context per iteration is why one task works

One task per iteration only makes sense because of the other half of the design: every iteration starts the agent with a fresh context window. The loop does not carry chat history forward. It reboots the agent’s understanding from files on disk each time.

This is the central idea of the Ralph technique, popularized by Geoffrey Huntley in his original Ralph writeup. The agent’s memory is not the conversation. It is the filesystem and the git history: .agent/tasks.json, the per-task specs, the logs, and the commits. Because state lives on disk, a fresh agent can reconstruct exactly where the project stands and pick up cleanly.

Fresh context and one-task scope are two sides of the same coin. Fresh context per iteration only helps if the unit of work fits inside one clean window, and one task is the unit sized to do that. Pair them and each iteration is crisp: clean slate in, one verified task out. Try to batch tasks on top of fresh context and you reintroduce the bloat that fresh context was meant to prevent.

flowchart TD
  Fresh["Fresh context window"] --> Read["Read SUMMARY.md and tasks.json"]
  Read --> Pick["Pick highest-priority incomplete task"]
  Pick --> Spec["Open one TASK-{ID}.json"]
  Spec --> Work["Work the steps for that one task"]
  Work --> Gate["Verify: tests, lint, types, screenshot"]
  Gate -->|"fail"| Work
  Gate -->|"pass"| Commit["Set passes true and commit one task"]
  Commit --> Stop["Stop. Iteration ends at a checkpoint"]
  Stop --> Next["Next iteration starts with fresh context"]
  Next --> Fresh

Read that diagram as a single lap. The only way out of a lap is a verified commit or an explicit stop signal. There is no path where the agent commits two tasks in one pass, because the lap ends the moment one task is done.

Clean commits are your checkpoints and rollback points

Treat each per-task commit as a save point in a game. When the agent finishes a task and commits, it records a state you can return to. If iteration twelve goes sideways, you do not lose the eleven good iterations before it. You reset to the last clean commit and keep the verified work.

This is why the commit is not optional and not deferred. The loop’s standard iteration ends by updating the task status, taking a screenshot for UI work, and committing in Conventional Commit format. One task, one commit, one checkpoint. The git log becomes a ledger of progress where each line is a unit you can read, verify, and if necessary revert in isolation.

Batching destroys this. A commit that bundles several tasks is a single, coarse save point. Revert it and you lose good work alongside the bad. Bisect it and every “bad” commit contains multiple changes, so you cannot pin the regression to one task. The granularity of your commits is the granularity of your recovery, and one task per commit is the finest useful grain.

There is a second benefit that matters for long runs. Because every checkpoint is verified before it is committed, the branch is always in a shippable-ish state between iterations. You can stop the loop at any point, after iteration three or after iteration thirty, and what you have is a set of completed, tested tasks rather than a half-built mess. That is what lets you run an agent overnight and trust the morning diff.

How Ralph enforces one task per iteration

The rule is not a suggestion the agent is free to ignore. It is wired into how the loop and the prompt work together.

The loop itself is a Bash script (ralph.sh) that runs the agent once per iteration. Each iteration follows the same shape:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the ordered steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update task status, and commit.
Repeat until all tasks pass or the iteration cap is reached.

Step one is where scope gets enforced: the agent selects the highest-priority incomplete task, singular. The prompt sent each iteration tells the agent to complete that one task, commit, and stop rather than continuing to the next. Because the next iteration starts a brand new agent with fresh context, there is no natural way to “keep going” across tasks. The process boundary between iterations is the enforcement mechanism.

You control how many iterations run. The default is ten. Use the flag to set your own cap:

# Run up to 50 iterations (50 tasks, one per iteration)
./ralph.sh -n 50

# Run exactly one iteration to watch a single task closely
./ralph.sh --once

./ralph.sh --once is the cleanest demonstration of the rule. It runs a single iteration: one task, one commit, one stop. Use it the first time you point Ralph at a new task list, watch it complete exactly one task, review the commit, and only then turn it loose with a larger cap.

The loop does not stop on a feeling. It stops on an explicit signal. After the run, the agent emits a promise tag and the script maps it to an exit code:

<promise>COMPLETE</promise>         all tasks finished      exit 0
<promise>BLOCKED:reason</promise>   needs human help        exit 2
<promise>DECIDE:question</promise>  needs a decision        exit 3

If the loop hits its iteration cap before the list is finished, it exits with code 1 (MAX_ITERATIONS). Either way, every task that did complete completed on its own iteration, with its own verified commit. The completion signal is per-run; the one-task discipline is per-iteration.

What about tasks that are too small or too big?

The obvious objection: if the agent does only one task per iteration, the task had better be the right size. That is true, and it is why scoping happens before the loop ever runs, during breakdown.

If your tasks are too big, the agent runs out of room inside one iteration and you get half-finished work, the exact failure the rule was meant to prevent. The fix is not to relax the rule. It is to cut the task smaller. A good task is something an agent can finish in one short sitting, with its own acceptance criteria and its own verification. Getting tasks to that size is the subject of breaking a PRD into atomic agent tasks, and it is the prerequisite that makes one task per iteration practical.

If your tasks are too small, you simply burn an iteration on something trivial, which is cheap and harmless. The asymmetry matters: tasks that are slightly too small cost you a little time, tasks that are too big cost you a tangled branch. When in doubt, split.

The other piece is ordering. One task per iteration only flows smoothly if the highest-priority incomplete task is actually buildable right now, with its dependencies already satisfied. That is what the task lookup table manages. Each task declares its dependencies, and the loop respects them so it never tries to build the dashboard before the data exists. Scaling that ordering to hundreds of tasks is covered in task lookup tables for agents. Get the breakdown and the ordering right, and one task per iteration is not a constraint you fight. It is the rhythm the whole run settles into.

Where to go next

One task per iteration is the discipline that turns a spec into a reliable run. To build the rest of the workflow around it:

Spec-driven development with AI for the full Specify, Plan, Tasks, Implement loop this rule lives inside.
Breaking a PRD into atomic agent tasks for sizing tasks so one fits cleanly in one iteration.
Task lookup tables for agents for ordering and scaling to hundreds of tasks.
Ralph loop failure modes for what goes wrong when this rule and its guardrails are not in place.

Frequently asked questions

What does one task per iteration mean?

It means the agent completes exactly one task per run of the loop, commits the result, and stops. The next iteration starts a fresh agent that picks the next highest-priority incomplete task. The agent never batches several tasks into a single invocation.

Why should an agent not do multiple tasks in one run?

Batching causes three problems. The context window fills with unrelated state and the agent starts contradicting itself, which is context rot. A failure mid-batch leaves several tasks half-finished and unshippable. And the resulting commit bundles many changes, so it is hard to review, revert, or bisect. One task per iteration keeps context small, failures local, and commits atomic.

How does one task per iteration relate to fresh context?

They are two halves of the same design. Each iteration starts the agent with a fresh context window and reconstructs state from files on disk, not from chat history. That only works if the unit of work fits inside one clean window, and one task is the unit sized to do that. Together they keep every iteration crisp: clean slate in, one verified task out.

How does Ralph enforce one task per iteration?

The loop runs the agent once per iteration. Each iteration the agent finds the single highest-priority incomplete task in .agent/tasks.json, works only that task spec, runs tests, lint, and type checks, updates the status, commits, and stops. Because the next iteration starts a brand new agent with fresh context, there is no way to carry work across tasks. The process boundary between iterations is the enforcement.

What if a task is too big to finish in one iteration?

Do not relax the rule, cut the task smaller. A good task is something an agent can finish in one short sitting with its own acceptance criteria. If the agent runs out of room inside one iteration, the breakdown was too coarse. Split the task during the breakdown phase so one task fits cleanly in one iteration.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Breaking a PRD Into Atomic Agent Tasks

Fri, 17 Apr 2026 00:00:00 GMT

To break a PRD into tasks an agent can build, you decompose it into atomic task packets: small, self-contained units of work, each with one objective, the files to inspect, ordered steps, and acceptance criteria a machine can check. The agent picks one packet, builds it, verifies it, commits, and stops. The PRD is the contract for the whole project. A task packet is the contract for one iteration.

This post is the decomposition itself: what goes inside a single packet, how Ralph stores packets in .agent/tasks.json and .agent/tasks/TASK-{ID}.json, how to size each one so a single iteration can finish and commit, and how dependencies and priority decide the order the loop works them. It assumes you already have a buildable PRD. If you do not, start with how to write a PRD an AI agent can actually build from, then come back here to chop it up.

Why a PRD is the wrong unit of work for an agent

A PRD describes a finished product. An agent does not build finished products. It edits files in a single context window, and that window is finite. Hand it the whole PRD and tell it to build, and one of two things happens. It tries to do everything at once and produces a sprawling diff you cannot review, or it loses the plot halfway through and starts contradicting decisions it made an hour earlier. That second failure is context rot, and it gets worse the longer a single session runs.

The fix is to never ask the agent to hold the whole project in its head. You decompose the PRD into the smallest pieces that still make sense on their own, and you feed the agent exactly one piece per iteration with a fresh context window each time. The Ralph technique is built around this: each loop starts the agent clean, it reads the task list, it picks one packet, and the filesystem carries the memory between passes instead of a long chat history.

So the real question is not “how do I get the agent to build my PRD.” It is “what is the smallest packet of work I can define that the agent can finish and verify in one sitting.” Get that unit right and the loop becomes reliable. Get it wrong and no amount of prompting saves you.

What goes inside an atomic task packet

A task packet is atomic when it has a single objective and everything the agent needs to hit it. Four parts make a packet complete.

The objective. One sentence describing the one thing this packet delivers. Not “build authentication.” That is a feature, not a task. “POST /api/auth/register creates a new user account” is a task. If you cannot state the objective in a single sentence without the word “and,” the packet is too big and you split it.

The files to inspect. Name the files the agent should read before it touches anything, and the files it is expected to change. This is the cheapest way to stop an agent from inventing a parallel module next to the one you already have. In Ralph these live in the step details and in technicalNotes: “extend the existing auth module in src/lib/auth.ts, do not add a second one.” A packet that points the agent at the right files reuses code instead of duplicating it.

The ordered steps. The packet breaks the objective into a short sequence the agent works in order. Each step is concrete enough to act on, and the last step is almost always “write the tests that prove the acceptance criteria.” Tests are steps inside the packet, not a separate task scheduled for later.

Testable acceptance criteria. Each criterion is a condition the agent can confirm by running a command and reading the output. “Login works” is a vibe. “POST with a wrong password returns 401 and the body { error: 'Invalid credentials' }” is a criterion. The test is simple: can a machine return yes or no without your opinion? If not, rewrite it. The discipline of writing verifiable criteria is the same one that makes the PRD buildable, covered in how to write a PRD for an AI agent.

The packet is the unit of verification, not just the unit of work. When the agent finishes, it does not ask you whether it is done. It checks the criteria, and the criteria answer for it.

How Ralph stores this: a lookup table plus per-task specs

Ralph splits the task list into two layers, and the split is what lets a run scale to hundreds of packets without drowning the agent in detail.

The root .agent/tasks.json is a lookup table, not the detail. It is a flat list where each entry is a stub that points at a spec file. The agent scans this list every iteration to find the next thing to do, so it stays lean on purpose.

[
  {
    "id": "TASK-1",
    "title": "Verify project prerequisites and access",
    "category": "setup",
    "specFilePath": ".agent/tasks/TASK-1.json",
    "passes": false
  },
  {
    "id": "TASK-2",
    "title": "Add avatars storage bucket and config",
    "category": "infrastructure",
    "specFilePath": ".agent/tasks/TASK-2.json",
    "passes": false
  },
  {
    "id": "TASK-3",
    "title": "POST /api/avatar uploads and stores a user avatar",
    "category": "api-endpoint",
    "specFilePath": ".agent/tasks/TASK-3.json",
    "passes": false
  }
]

Each stub carries a passes flag. It starts false and only flips to true after the agent verifies the work for that packet. The loop reads this flag to decide what is left. Keeping the table this thin matters: a run with two hundred packets still scans fast, because the table holds titles and pointers, not the full contracts. That scaling pattern is its own topic in task lookup tables for agents.

The detail lives in the per-task spec at .agent/tasks/TASK-{ID}.json. This is the full packet: the objective as a description, the files to inspect inside the steps, the ordered steps, the testable acceptance criteria, the dependencies, an estimated complexity, and any technical notes.

{
  "id": "TASK-3",
  "title": "POST /api/avatar uploads and stores a user avatar",
  "category": "api-endpoint",
  "description": "Accept an image upload from an authenticated user, validate it, store it in the avatars bucket, and save the URL on the user row.",
  "acceptanceCriteria": [
    "POST with a valid PNG or JPEG under 2 MB returns 200 with the stored avatar URL",
    "A file over 2 MB returns 413 with the error text File too large",
    "A non-image content type returns 415",
    "An unauthenticated request returns 401",
    "The avatar URL is persisted on the user row and survives a re-fetch"
  ],
  "steps": [
    {
      "step": 1,
      "description": "Add the upload route handler",
      "details": "Read src/lib/auth.ts for the session helper and src/lib/storage.ts for the bucket client. Add POST /api/avatar, validate size and content type, upload, then update users.avatarUrl.",
      "pass": false
    },
    {
      "step": 2,
      "description": "Write tests for every acceptance criterion",
      "details": "Add Vitest cases for valid upload, oversized file, wrong content type, unauthenticated request, and persistence after re-fetch.",
      "pass": false
    }
  ],
  "dependencies": ["TASK-1", "TASK-2"],
  "estimatedComplexity": "medium",
  "technicalNotes": [
    "Reuse the storage client in src/lib/storage.ts, do not add a second SDK",
    "Strip EXIF data before storing the image"
  ]
}

Notice the four parts of a packet map onto the JSON. description is the objective. The step details name the files to inspect and change. steps is the ordered sequence. acceptanceCriteria is the checkable definition of done. TASK-1 is always reserved for prerequisite verification, so the agent confirms credentials, tools, and access exist before any feature work starts.

Here is how a PRD turns into packets and lands in the lookup table.

flowchart TD
  PRD["prd/PRD.md: goals, constraints, criteria"] --> Decompose["Decompose into atomic packets"]
  Decompose --> P1["tasks/TASK-1.json: prerequisite gate"]
  Decompose --> P2["tasks/TASK-2.json: storage config"]
  Decompose --> P3["tasks/TASK-3.json: upload endpoint"]
  Decompose --> Pn["tasks/TASK-N.json: ..."]
  P1 --> Index["tasks.json lookup table (stubs + passes)"]
  P2 --> Index
  P3 --> Index
  Pn --> Index
  Index --> Loop["ralph.sh: pick one packet per iteration"]
  Loop --> Verify["Build, test, screenshot, commit, set passes true"]

The PRD is the source. Decomposition produces one spec file per packet. The lookup table indexes them. The loop reads the table, opens one spec, and works it. You do not write all of this by hand: the prd-creator skill generates the table and the spec files from your PRD, then you edit the packets that need sharpening.

Sizing a packet so one iteration can finish and commit

The hardest part of decomposition is sizing. Too big and the agent runs out of context before it finishes, leaves the packet half built, and the next iteration starts cold on a mess. Too small and you spend more iterations on overhead than on work. The target is the largest packet that still finishes, verifies, and commits inside a single iteration.

A few concrete rules keep packets in the right range.

One objective, no “and.” If the title needs the word “and” to be honest, it is two packets. “Add the upload endpoint and the avatar UI” is two: the endpoint is an api-endpoint packet, the UI is a ui-ux packet that depends on it. Split on the “and.”

One area of the codebase. A packet that touches the data model, the API, and the frontend in one pass is too wide. Each layer is its own packet. The migration is one. The endpoint that reads the new column is another. The component that renders it is a third. Narrow packets keep the diff readable and the failure isolated.

Verifiable in the steps it contains. If proving the criteria would take more steps than building the feature, the packet is doing too much. A well sized packet has a handful of steps, and the test step covers the criteria without ballooning into a second project.

A clean commit at the end. The end state of every packet is a commit that builds, passes its tests, and changes nothing it did not need to. If you cannot imagine that commit as a single coherent change, the packet is wrong.

The economics back this up. The loop runs one packet per iteration, and you cap iterations with ./ralph.sh -n 50 (the default is 10). A run of small packets makes steady, auditable progress: one commit per packet, one test suite per criterion, one screenshot per UI change. A run of giant packets thrashes, because the agent keeps almost finishing and never quite committing. When you want to watch a single packet land before turning the loop loose, run exactly one iteration:

npx @pageai/ralph-loop
./ralph.sh --once

The rule that one invocation handles exactly one packet, commits, and stops is the reliability backbone of the whole approach. It has enough nuance to deserve its own treatment in one task per iteration.

Dependencies and priority ordering

Atomic packets are not independent of each other. The upload endpoint cannot exist before the storage bucket. The avatar UI cannot render a URL the endpoint does not yet return. Decomposition has to capture this order, or the agent builds on a foundation that is not there and the iteration fails.

Ralph encodes order in two places. The dependencies array on each spec names the packets that must pass first. In the example above, TASK-3 depends on ["TASK-1", "TASK-2"], so the loop will not start the upload endpoint until the prerequisite gate and the storage config both report passes: true. The agent respects this when it selects the next packet, which means you can write the whole task list up front without worrying that the agent jumps ahead.

Priority decides what to do among packets that are all unblocked. Each iteration the loop finds the highest-priority incomplete task whose dependencies are satisfied, opens its spec, and works it. The selection logic each pass is small.

flowchart TD
  Start["Fresh context"] --> Scan["Scan tasks.json for incomplete packets"]
  Scan --> Ready{"Dependencies all passes true?"}
  Ready -->|"no"| Skip["Skip, try next packet"]
  Ready -->|"yes"| Pick["Pick highest priority among ready packets"]
  Skip --> Scan
  Pick --> Work["Open TASK-{ID}.json, work the steps"]
  Work --> Gate["Tests, lint, types, screenshot"]
  Gate -->|"fail"| Work
  Gate -->|"pass"| Commit["Set passes true, commit, stop"]

Two practices keep ordering dependable. First, front-load the foundation. The prerequisite gate is TASK-1 for a reason: nothing should run before the agent confirms it has the access and tools to run at all. Data-model packets come early, because most feature packets depend on them. Second, keep dependency chains shallow. A packet that depends on a packet that depends on a packet is a sign the work is really one larger unit you split too aggressively, or that an intermediate packet is doing too little. Wide and shallow beats narrow and deep, because shallow graphs give the loop more ready packets to choose from at any moment, which keeps it moving even if one branch is blocked.

When you need to reorder mid-run, you do not kill the loop. You edit .agent/STEERING.md, and the agent handles that critical work on its next iteration before returning to the task list. That is how you inject “fix the failing migration before anything else” without losing momentum.

A short worked example

Take “let users upload a profile avatar” from a PRD and watch it become packets.

That sentence is a feature, not a task. Decomposition asks: what are the smallest verifiable units, and in what order. TASK-1 is the prerequisite gate, always. TASK-2 configures the storage bucket and is a dependency for everything that stores a file. TASK-3 is the upload endpoint, depending on the bucket. TASK-4 is the avatar component in the UI, depending on the endpoint because it renders the URL the endpoint returns. TASK-5 handles deletion, depending on the upload existing.

Each packet gets criteria a machine can check, not opinions. For the endpoint: a valid image under the size limit returns 200 with a URL, an oversized file returns 413, a wrong content type returns 415, an unauthenticated request returns 401. For the UI: the component shows a placeholder when no avatar is set, shows the image after a successful upload, and a Playwright run plus a screenshot confirms it. The feature that read as one line in the PRD is now five committed, tested steps, each small enough to finish in a single iteration, ordered so the loop never builds ahead of its foundation.

The framing of phases that produces these packets (specify, plan, decompose into tasks, implement and verify) comes from GitHub Spec Kit, and the loop that runs the packets autonomously was popularized by Geoffrey Huntley in his original Ralph writeup. Decomposition is the phase that decides whether the loop succeeds, which is why it is worth slowing down for.

Where to go next

If you are building this workflow, read across the spec-driven cluster:

Spec-driven development with AI for the full Specify, Plan, Tasks, Implement workflow these packets sit inside.
How to write a PRD an AI agent can actually build from for the goals, constraints, and acceptance criteria you decompose here.
One task per iteration for the rule that makes one-packet-at-a-time reliable.
Task lookup tables for agents for scaling the table to hundreds of packets.

For the mechanics of the loop that reads these packets on every pass and the fresh-context design behind it, start with what is the Ralph technique.

Frequently asked questions

How do I break a PRD into tasks an AI agent can build?

Decompose the PRD into atomic task packets, where each packet has one objective, the files to inspect, ordered steps, and acceptance criteria a machine can check by running a command and reading the output. Split on any objective that needs the word and, keep each packet inside one area of the codebase, and make sure the agent can finish and commit it in a single iteration.

What makes a task packet atomic?

A packet is atomic when it has a single objective and everything the agent needs to hit it without holding the rest of the project in its head. That means one deliverable, the files to read and change, a short ordered sequence of steps with tests as the final step, and verifiable acceptance criteria. If you cannot state the objective in one sentence without the word and, the packet is too big.

How does Ralph store a decomposed PRD?

Ralph uses two layers. The root .agent/tasks.json is a lean lookup table of stubs, each with an id, title, category, a pointer to a spec file, and a passes flag. The detail lives in per-task specs at .agent/tasks/TASK-{ID}.json, which hold the description, acceptance criteria, ordered steps, dependencies, estimated complexity, and technical notes. The agent scans the table to pick a packet and opens the spec to work it.

How big should a single agent task be?

The right size is the largest packet that still finishes, verifies, and commits inside one iteration. Too big and the agent runs out of context and leaves the work half done. Too small and overhead dominates. Aim for one objective, one area of the codebase, a handful of steps including the tests, and a single clean commit at the end.

How do dependencies and priority decide task order?

Each spec has a dependencies array naming the packets that must pass first, so the loop will not start a packet until its dependencies report passes true. Among packets whose dependencies are satisfied, the loop works the highest priority one. Front-load the prerequisite gate and data-model packets, keep dependency chains shallow, and edit STEERING.md to reorder mid-run without stopping the loop.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

How to Steer a Running AI Agent Without Stopping the Loop

Mon, 13 Apr 2026 00:00:00 GMT

To steer a running AI agent, edit .agent/STEERING.md while the loop is going. The agent reads that file at the top of every iteration, before it picks a task. Anything you write there gets handled first, in order, then the agent removes the finished items and goes back to the task list. You change direction without killing the process and without losing the momentum of a warm run.

This is the redirect control inside the larger system for running an AI coding agent overnight. A loop that runs for hours will sometimes pick a task you no longer want, chase a dead end, or need a hotfix you only just noticed. Steering ai agent behavior through a file is how you correct course mid-flight instead of stopping, editing, and starting over.

What steering an AI agent actually means

Steering is not a chat message and not a new prompt. It is a single Markdown file, .agent/STEERING.md, that lives next to the rest of the agent state. The Ralph prompt has one job for it. In the “Before Starting” section of .agent/PROMPT.md, the agent is told to check .agent/STEERING.md for critical work, complete the items in sequence, remove them when done, and only proceed to implement tasks if no critical work is pending.

That ordering is the whole mechanism. Because each iteration starts the agent with a fresh context window, the agent rereads the prompt and the steering file every single loop. There is no stale chat history to fight and no need to interrupt the agent mid-thought. You write to a file, and the next clean iteration discovers it.

A steering file looks like a short work order. The default one in the repo handles sandbox setup before any feature work runs:

# Critical Steering Work

## Install and verify dependencies

Run `npm install --ignore-scripts`, then fix native arm64 binaries.

## Main Tasks

Install Playwright system deps, start the dev server, take a screenshot.

---

After you finish this work, exit with message `Steering complete`.

You can replace that content with whatever you need the agent to do next. The file is plain text, so you edit it with any tool from any machine that has the working tree, including a shell inside the sandbox.

When to steer a running agent

Steer when the loop is healthy enough to keep running but pointed at the wrong thing, or when something urgent jumps the queue. There are three common triggers.

The agent is stuck. You are watching the run and the same failure keeps coming back across iterations. Maybe a test depends on a binary that did not install, or a dev server never came up. The agent is not blocked enough to emit a promise tag, but it is spinning. A steering note that tells it exactly how to fix the environment clears the jam. Spotting this early is the payoff of observability for autonomous coding agents: you see the climbing iteration time and the repeated debugging step before the loop burns ten more iterations.

The agent went the wrong direction. The task list was fine when you wrote it, but the agent picked a task whose approach you now disagree with, or requirements changed since you started the run. You do not want to throw away the work it already shipped. You want it to switch tracks on the next iteration.

You have an urgent fix. A bug landed in production, a dependency needs pinning, or a teammate needs a small change merged into the same branch. The loop is the fastest path to get it done, and you would rather inject the work than wait for the run to finish.

Steering is the wrong tool for two cases. If the agent is fundamentally confused about the goal, fix .agent/tasks.json or the PRD instead, because steering is for one-off interventions, not for rewriting the plan. And if the agent emitted BLOCKED or DECIDE and the loop already exited, you do not steer a stopped process. You answer it and rerun, which I cover below.

How the loop reads STEERING.md every iteration

The reason steering works without a restart is that the loop already reopens the file on every pass. Each iteration is a clean run of the agent against the prompt, and the prompt routes through the steering check before it ever touches the task list.

flowchart TD
    Start["Iteration starts (fresh context)"] --> Read["Read .agent/PROMPT.md"]
    Read --> Check{"STEERING.md has critical work?"}
    Check -->|"yes"| Steer["Do steering items in sequence"]
    Steer --> Remove["Remove finished items from STEERING.md"]
    Remove --> Tasks["Pick highest-priority task in tasks.json"]
    Check -->|"no"| Tasks
    Tasks --> Verify["Run tests, lint, typecheck"]
    Verify --> Commit["Commit, emit promise tag"]
    Commit --> Next["Next iteration"]
    You["You edit STEERING.md mid-run"] -.->|"injected here"| Check

The dotted edge is the part you control. Your edit lands in the file at any moment, and the next iteration picks it up at the diamond. You never have to time it perfectly. If iteration 12 is in flight when you save the file, iteration 13 reads your note. Because the agent removes completed steering items, the file empties itself out and the loop returns to normal task work without you touching it again.

This is why a fresh context per iteration is a feature, not a limitation. A single long session would have buried your note under thousands of tokens of prior reasoning. A loop that reloads its instructions every pass treats the file as the source of truth, so the latest version of .agent/STEERING.md always wins.

Steering versus killing and restarting the loop

The blunt alternative is to stop the run, edit something, and start again. Steering beats that on three counts.

You keep warm state. A sandbox that already installed dependencies, built the project, and warmed its caches is expensive to recreate. Ralph runs each agent in a deterministic Docker Sandbox named ralph-<agent>-<dir>-<hash8>, and that sandbox persists between iterations. Killing the loop and recreating the box means paying the setup cost again. Steering reuses the running sandbox.

You keep the run history intact. Every iteration writes a clean transcript to .agent/history/, appends to .agent/logs/LOG.md, and commits per task. A restart starts a new session id and fragments that trail. Steering threads your intervention into the same continuous record, so the morning review reads as one story.

You avoid the gap. When you kill a loop manually, you have to remember to restart it. People walk away and forget. A steering note keeps the loop alive and self-correcting, which is the entire point of an unattended overnight run.

There is one case where stopping is correct: when you want to change how the loop runs, not what it does next. Flags like the iteration count and the agent are set at launch, so switching from ./ralph.sh -n 50 to a different agent with ./ralph.sh --agent codex does require a restart. For everything that is “do this work next,” steer instead.

If you only need to validate a single steering note before trusting it overnight, run one pass with ./ralph.sh --once, confirm the agent did what the file said, then start the full loop.

How to write a steering note that works

A good steering note reads like a task spec: specific, ordered, and verifiable. A bad one reads like a vibe. The agent reads your note with a fresh context, so it has no memory of the hallway conversation in your head. Spell it out.

Write the exact commands and file paths. “Fix the build” is vague. The agent has to guess what broke and what done looks like. Compare:

# Critical Steering Work

## Pin the failing dependency

1. In `package.json`, set `vite` to `5.4.10` exactly.
2. Run `npm install --ignore-scripts`.
3. Run `npm run build` and confirm it exits 0.
4. Commit with message `fix: pin vite to 5.4.10`.

---

After you finish this work, exit with message `Steering complete`.

Every line is checkable. The agent knows the version, the commands, the success condition (build exits 0), and the commit message. That is the same standard the loop expects from a real task, and the same reason verification loops matter: a steering note with a verifiable end state lets the agent prove it is done instead of guessing.

Give the agent an exit condition. The default file ends with “After you finish this work, exit with message Steering complete.” That tells the agent the boundary of the steering work so it does not bleed into task work uninvited. Keep that pattern. Without a clear end, the agent may either stop too early or wander.

Keep it small. Steering is for one or a few critical items, not a second task list. If your note grows into a dozen steps across unrelated areas, that is a sign the work belongs in .agent/tasks.json as real tasks with their own specs. Steering is the fast lane, not the highway.

Order matters. The prompt says complete items in sequence. List prerequisites first. If the agent must install a binary before a test can pass, the install goes above the test.

A few quick contrasts:

Vague: “Make the homepage better.” Specific: “On the homepage hero, change the CTA label to Start free, verify with a Playwright screenshot saved to .agent/screenshots/.”
Vague: “Tests are flaky, look into it.” Specific: “The test auth.spec.ts fails on a timeout. Increase the wait in that file to 10 seconds and rerun it until green.”
Vague: “Clean up the deps.” Specific: “Remove lodash from package.json, replace its three usages in src/utils/ with native equivalents, run the unit tests.”

The pattern is always the same. Name the file, name the command, name the proof.

Combining steering with BLOCKED and DECIDE promises

Steering and promise tags solve different halves of the human-in-the-loop problem, and they work together.

Promise tags are how the agent talks to you. When the agent cannot continue, it emits <promise>BLOCKED:reason</promise> and the loop exits with code 2. When it needs a decision it cannot make alone, it emits <promise>DECIDE:question</promise> and exits with code 3. A clean finish is <promise>COMPLETE</promise> with exit code 0, and hitting the iteration cap is exit code 1. Ralph plays a sound and sends a desktop notification on BLOCKED and DECIDE, so you find out the moment a human is actually needed.

Steering is how you talk back. The promise tells you the loop stopped and why. Your answer goes into .agent/STEERING.md, and then you rerun the loop. The next iteration reads your steering note first, handles the unblock or the chosen direction, removes it, and continues with the task list. The flow is tight:

# loop exits 2 with BLOCKED:cannot reach npm registry
# you allow the domain from the host, then leave a steering note
# in .agent/STEERING.md describing the retry, then:
./ralph.sh -n 50

For a DECIDE, the agent gave you a question like “Option A vs B.” You pick one in the steering file in plain terms: “Use option A. Implement the REST client, not the GraphQL one. Then continue tasks.” The agent does not relitigate the choice because the steering note is now the instruction, and the fresh context means it reads that note as current truth.

This pairing is the reason an overnight run stays autonomous without being unaccountable. The agent stops itself on the two events that genuinely need a person, notifies you, and waits. You drop a precise note into one file and restart. The promise mechanics behind those exit codes are worth knowing in full if you want the loop to stop on a signal rather than a guess, and the deny-by-default network policy that triggers many BLOCKED exits is enforced by the sandbox, not the agent.

When a steering note itself is not enough, get inside the box. List the running sandboxes and open a shell to debug by hand, then write what you learned back into the steering file:

sbx ls
sbx exec -it <ralph-sandbox-name> bash

Print the exact name for your project with ./ralph.sh --print-name. Between promise tags that pull you in, a steering file that pushes instructions back, and a sandbox you can open and inspect, you can run a long autonomous loop and still keep both hands on the wheel whenever you want them there.

Frequently asked questions

How do I steer a running AI agent in a Ralph loop?

Edit the .agent/STEERING.md file while the loop is running. The agent reads that file at the top of every iteration, before it picks a task from tasks.json. It completes the steering items in sequence, removes them when done, and then resumes normal task work. Because each iteration starts with a fresh context, you do not have to time the edit precisely. The next iteration picks up whatever the file says.

Do I have to stop the loop to change what the agent is doing?

No. That is the point of steering. Editing STEERING.md injects work into the next iteration without killing the process, so you keep the warm sandbox, the running session, and the per-task git history intact. You only need to restart the loop when you want to change how it runs, such as switching the agent or the iteration count, since those flags are set at launch.

What makes a good steering note?

Specific, ordered, verifiable instructions. Name the exact files and commands, list prerequisites first, and give a clear success condition such as a build exiting 0 or a test going green. End the note with an explicit exit message so the agent knows where steering work stops. Keep it to one or a few critical items. If it grows into a full plan, that work belongs in tasks.json as real tasks instead.

How does steering work together with BLOCKED and DECIDE promises?

Promise tags are how the agent signals it needs you. BLOCKED exits with code 2 and DECIDE exits with code 3, both with a desktop notification. Steering is how you respond. You write the unblock steps or the chosen option into STEERING.md and rerun the loop, and the next fresh-context iteration reads your note first, handles it, removes it, and continues the task list.

Where does the agent look for steering instructions?

In .agent/STEERING.md. The .agent/PROMPT.md file tells the agent in its Before Starting section to check that file for critical work, complete items in sequence, remove them when done, and only proceed to implement tasks if no critical work is pending. So the latest version of STEERING.md is always the source of truth for what the agent does next.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Claude Code vs Codex vs Cursor vs Gemini: Best CLI for Long-Running Agent Loops

Wed, 08 Apr 2026 00:00:00 GMT

There is no single best agentic CLI for long-running loops. The honest answer is that you pick per task, and the agent matters less than the loop you wrap around it. Claude Code, OpenAI’s Codex CLI, the Cursor CLI agent, Gemini CLI, GitHub Copilot CLI, and opencode all run inside Ralph behind the same --agent flag, so swapping one for another is a one word change, not a rewrite. This post compares the six on the dimensions that actually decide a multi-hour run: autonomy quality, cost, model options, tool use, and ecosystem.

I am not going to quote benchmark numbers. Vendor leaderboards move every few weeks and rarely reflect how an agent behaves over dozens of iterations against your codebase. What follows is concrete and opinionated, grounded in how each CLI is invoked and how each one tends to behave when you leave it running.

Why “best” is the wrong question for a multi-hour loop

A one-shot demo and a multi-hour loop are different sports. In a demo, raw reasoning on the first try wins. In a loop, the agent runs the same task family dozens or hundreds of times, sees its own test failures, and fixes them on the next pass. What you actually want is an agent that responds well to feedback, follows a structured prompt, and exits cleanly so the harness can decide whether to loop again.

That reframes the comparison. The strongest model on a leaderboard can still lose a long run if it ignores the prompt structure, never stops on its own, or burns your budget on token-heavy reasoning for mechanical work. A merely good model with tight verification gates will grind toward correct. This is the core point from the field guide to agentic coding CLIs: the harness around the agent carries more of the result than the agent does.

So the real question is not “which CLI is smartest.” It is “which CLI holds up when I leave it alone.” Hold that distinction while you read the comparison.

How Ralph makes the six CLIs interchangeable

Ralph is a Bash script you point at a project. It does not reimplement any agent. It wraps whichever one you pick, runs it inside an isolated Docker Sandbox, watches the output, and loops with a fresh context each iteration. Three design choices are what make the agent swappable.

First, one flag selects the agent. claude is the default, and you switch with --agent (short -a):

# Claude Code (default), 50 iterations
./ralph.sh -n 50

# Swap the agent, nothing else changes
./ralph.sh --agent codex
./ralph.sh -a cursor -n 5

Second, a bare -- separator forwards anything after it straight to the underlying CLI. Ralph parses its own flags first, then hands the rest to the agent untouched. That is how you pin a model without Ralph needing to know each vendor’s model names:

./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -- --model pro

Third, each agent gets its own deterministic sandbox, named ralph-<agent>-<current-dir>-<hash8>. Your Claude sandbox and your Codex sandbox never share credentials, history, or installed tools, so you can compare agents on the same project without them stepping on each other. Print the name without starting a run:

./ralph.sh --print-name --agent codex

Under the hood, Ralph builds a different invocation per agent but keeps the loop identical. The expansions are:

# claude:    sbx run ... claude .   -- --output-format stream-json --verbose -p "$PROMPT_CONTENT"
# codex:     sbx run ... codex .    -- exec "$PROMPT_CONTENT"
# copilot:   sbx run ... copilot .  -- -p "$PROMPT_CONTENT"
# cursor:    sbx run ... cursor .   -- -p "$PROMPT_CONTENT"
# gemini:    sbx run ... gemini .   -- -p "$PROMPT_CONTENT"
# opencode:  sbx run ... opencode . -- run "$PROMPT_CONTENT"

Every one of those runs the same loop: pick the top task from .agent/tasks.json, work it, run the verification stack, commit, and either continue or stop on a promise tag. Because the loop is shared, you can treat the choice of CLI as a variable. Pick one, run it, and if it stalls on your codebase, change the flag and try another. The mechanics never move.

The dimensions that decide a long run

Five things separate these agents once you are running them unattended. I will go through each, then break the agents down one by one.

Autonomy quality is how well the agent works without a human in the chair. Does it follow the prompt structure, stay on one task per invocation, run its own tests, and emit a clean completion signal? A loop has nobody to answer “should I proceed?”, so any agent that pauses for approval will stall unless you put it in a non-interactive mode.

Cost over a loop is dominated by model choice and iteration count, not the per-call price you see in a demo. Fifty iterations on a flagship model add up. The lever is the model flag and the iteration cap, which I cover in depth alongside the overnight-run architecture in how to run an AI coding agent overnight.

Model options decide whether you can match the model to the task. Heavy reasoning (architecture, gnarly refactors) rewards the strongest model. Mechanical work (CRUD wiring, lint fixes, filling in tests) runs fine and cheaper on a mid tier model. The more model choices a CLI exposes through -- --model, the more you can tune.

Tool use is how the agent edits files, runs shell commands, and reads results. All six edit files and run commands. The difference shows up in how cleanly they run headless and how readable their output is while looping.

Ecosystem is auth, billing, and which world you already live in. A team standardized on GitHub auth has a different default than one paying for an Anthropic or OpenAI plan.

Claude Code

Claude Code is Ralph’s default for a reason. It is steady on long, multi-step tasks and follows a structured prompt closely, which is exactly what a fresh-context loop needs. Ralph runs it with --output-format stream-json --verbose, and that structured stream gives the loop the richest live, readable step view of any of the six. When you want to watch what the agent is doing each iteration, Claude Code shows the most.

It runs in bypass-permissions mode inside the sandbox (--dangerously-skip-permissions, or --permission-mode bypassPermissions), so it never pauses for approval during an unattended run. Model selection goes through -- --model. For autonomy quality and clarity of feedback, this is the smoothest first run. The end-to-end setup is in running Claude Code in a loop. See the Claude Code docs for the flag surface.

Codex CLI

Codex is OpenAI’s agent, run through its non-interactive exec mode. It is a strong reasoner on hard, self-contained problems, which makes it a natural pick when the task is logic-heavy rather than sprawling. Pin a model with -- --model gpt-5.5.

The thing to know about Codex in a loop: codex exec runs read-only by default, so a loop using the default mode will spin without ever editing a file. You grant write access deliberately with -- --sandbox workspace-write --ask-for-approval never, or bypass Codex’s own gates entirely with -- --dangerously-bypass-approvals-and-sandbox. The bypass flag is safe here because the microVM is the real boundary, not Codex policing itself. Codex also has a clean --json event stream for CI parsing. The full wiring, including the read-only gotcha and CI flags, is in running the Codex CLI in an autonomous loop.

Cursor CLI agent

The Cursor CLI agent brings the headless cursor-agent to the loop, invoked with -p. If your team already lives in Cursor, running the same agent unattended over a sandbox lets you review a finished diff in the morning instead of pair-programming all afternoon. The autonomy quality is solid, and the appeal is continuity: you keep the agent and the mental model you already trust, and you add looping on top. Model and other flags pass through after -- like every other agent.

Gemini CLI

Gemini CLI is Google’s agent, driven with -p and model selection through -- --model pro. It is worth reaching for when you want a different model family in the mix or you already live in the Google ecosystem. The argument for keeping a non-Anthropic, non-OpenAI option in your rotation is practical: when one agent stalls on a specific task, a different model family sometimes walks straight through it. Ralph makes that switch a one word change.

GitHub Copilot CLI

Copilot CLI slots in for teams standardized on GitHub’s tooling and auth. Ralph runs it with the same -p prompt pattern and the same sandbox boundary as the rest. The pull here is ecosystem, not raw capability: if your auth, your billing, and your repos already run through GitHub, Copilot is the path of least friction. The loop treats it identically to the others.

opencode

opencode is the open source option, invoked with its run subcommand. It is the pick when you want to avoid vendor lock-in, run an agent you can fully inspect and modify, or route to a provider and model of your own choosing. For cost-sensitive runs where you want maximum control over the model layer, an open agent you can point at any backend is a real advantage. You trade some of the polished, batteries-included feel of the vendor CLIs for control and inspectability.

Pick this when

Here is the practical version, stripped of hedging. Match the agent to the situation rather than hunting for one winner.

Pick Claude Code when you want the smoothest first loop and the clearest live view of what the agent is doing. It is the right default, and the right place to start if you are new to running loops.
Pick Codex when the task is logic-heavy and self-contained, and you want explicit model pinning plus a clean JSON event stream for CI. Remember to grant write access, or it will not edit anything.
Pick Cursor when your team already uses Cursor and you want the same agent to run unattended so you review a diff instead of babysitting.
Pick Gemini when you want a different model family in your rotation, or you are already in the Google ecosystem.
Pick Copilot when your auth and billing already run through GitHub and you want the least new setup.
Pick opencode when you want an open, inspectable agent, no vendor lock-in, or full control over the model and provider behind it.

A decision guide for the common case:

flowchart TD
  Start(["Choosing an agent for a long loop"]) --> Q1{"New to running loops?"}
  Q1 -->|"yes"| Claude["Start with claude (default, richest live view)"]
  Q1 -->|"no"| Q2{"What matters most?"}
  Q2 -->|"hard reasoning task"| Codex["codex (grant write access)"]
  Q2 -->|"already in Cursor"| Cursor["cursor"]
  Q2 -->|"different model family"| Gemini["gemini -- --model pro"]
  Q2 -->|"GitHub-standardized team"| Copilot["copilot"]
  Q2 -->|"open, no lock-in"| Opencode["opencode"]
  Claude --> Swap{"Stalls on your codebase?"}
  Codex --> Swap
  Cursor --> Swap
  Gemini --> Swap
  Copilot --> Swap
  Opencode --> Swap
  Swap -->|"yes"| Change["Change one --agent flag, retry"]
  Swap -->|"no"| Ship["Let the loop run, review in the morning"]

The last edge is the important one. Because the harness is shared, “this agent stalled” is not a dead end. It is a flag change. That is the whole reason Ralph treats the agent as swappable rather than picking one for you.

What actually moves the result

If you take one thing from this comparison, take this: a weaker agent inside a good loop will out-ship a stronger agent you babysit in a single session. The loop gives the agent fresh context every iteration, keeps state on disk, isolates it in a sandbox, and runs real verification gates. Those four things matter more than the gap between any two of these CLIs.

Fresh context per iteration is what beats context rot. Each pass boots the agent clean, so it does not drag an hours-long transcript from one task to the next. The filesystem and git history are the memory layer: progress lives in .agent/tasks.json, the per-task spec files, .agent/logs/LOG.md, and the git log, not in a chat window. This is the mechanic Geoffrey Huntley described in the original Ralph writeup, and it applies identically to all six agents.

Verification is what lets a cheaper model succeed. The loop assumes a stack of Playwright for end-to-end tests, Vitest for unit tests, TypeScript for types, ESLint for lint, and Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work. The agent does not need to be right on the first try. It needs to write code, run the gates, read the failures, and fix them next pass. When you compare agents for long runs, you are really comparing how each one responds to that feedback, not how clever its first draft looks.

The sandbox is what makes any of this safe to walk away from. Ralph runs every agent inside a Docker Sandbox microVM with deny-by-default networking, so bypass-permissions mode is reasonable: the worst the agent can do is wreck a disposable VM, not read your SSH keys. You open exactly what a task needs with sbx policy allow network <name> <domain>. The full reasoning is in the Docker Sandboxes docs.

And the loop stops on a signal, not a vibe. Each agent emits a promise tag that Ralph reads:

<promise>COMPLETE</promise> means every task is finished.
<promise>BLOCKED:reason</promise> means the agent needs human help.
<promise>DECIDE:question</promise> means it needs a decision you have to make.

Those map to exit codes: 0 for COMPLETE, 1 for MAX_ITERATIONS, 2 for BLOCKED, and 3 for DECIDE. You branch on them in CI or a wrapper script, which means the comparison between agents is also fair: every one of them ends with a verdict you can act on.

Putting it together

There is no best agentic CLI, so stop looking for one. Start with Claude Code because it is the steady default with the clearest output. Reach for Codex on hard reasoning tasks, Cursor when your team already uses it, Gemini for a different model family, Copilot when you are GitHub-standardized, and opencode when you want an open agent with no lock-in. Then let the harness do the heavy lifting.

The shortest path to running any of them is three commands:

# 1. install
npx @pageai/ralph-loop

# 2. authenticate the agent inside its sandbox
./ralph.sh --login --agent codex

# 3. run the loop on your chosen agent and model
./ralph.sh --agent codex -n 50 -- --model gpt-5.5

Swap codex for any of the six and the loop is identical. Pick the agent, pin the model, and let it work one task per iteration inside the sandbox while you sleep. If it stalls, you already know the fix: change one flag.

Frequently asked questions

Which is the best agentic CLI for long-running loops?

There is no single winner. Claude Code is the steady default with the richest live output, Codex is strong on hard reasoning tasks, Cursor suits teams already using Cursor, Gemini brings a different model family, Copilot fits GitHub-standardized teams, and opencode is the open source option. The loop you wrap around the agent matters more than the agent, so pick one and swap with the --agent flag if it stalls.

Claude Code vs Codex: which should I use?

Use Claude Code as the default when you want the smoothest unattended run and the clearest live view of each iteration, since Ralph parses its stream-json output. Use Codex for logic-heavy, self-contained problems and CI pipelines that parse its JSON event stream. One gotcha: codex exec is read-only by default, so grant write access with -- --sandbox workspace-write --ask-for-approval never or it will not edit files.

How do I switch between agents in Ralph?

Pass the --agent flag, or -a for short. ./ralph.sh runs Claude Code by default, ./ralph.sh --agent codex runs Codex, and ./ralph.sh -a cursor -n 5 runs Cursor for five iterations. Each agent gets its own sandbox named ralph-<agent>-<dir>-<hash8>, so they do not share credentials or history. The loop itself is identical across all six.

Does the choice of agent matter more than the loop?

No. A weaker agent inside a good loop, with fresh context each iteration, state on disk, sandbox isolation, and real verification gates, will out-ship a stronger agent you babysit in a single session. Those four properties decide the result more than the gap between any two CLIs, which is why Ralph treats the agent as a swappable variable.

How do I pick a model for each agent?

Model selection is an agent flag, not a Ralph setting, so you pass it after the -- separator. For example ./ralph.sh --agent codex -- --model gpt-5.5 or ./ralph.sh -a gemini -- --model pro. Match the model to the task: a strong model for heavy reasoning, a cheaper mid tier model for mechanical work, since model choice and iteration count are the biggest cost levers in a long run.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Ralph Loop Failure Modes (Context Rot, Runaway Cost) and How to Avoid Them

Sat, 04 Apr 2026 00:00:00 GMT

A Ralph loop does not fail in mysterious ways. It fails in a short list of predictable ones: the agent forgets the plot, the loop burns money forever, it thrashes on a task that is too big or too vague, it ships work that looks done but is wrong, or it damages something on your machine. Each of these has a known cause and a specific guardrail. None of them require luck to avoid.

This post walks through every failure mode of running an AI coding agent in a loop, why each one happens, and the exact mechanism in ralph.sh that prevents it. If you are setting up a loop for the first time or wondering why a run went sideways, this is the checklist.

What are the failure modes of a Ralph loop?

Five, mostly. They are context rot, runaway cost, thrashing on a bad task, silent wrong work, and sandbox damage. Every one maps to a guardrail you can turn on or design around. Here is the short version before the details:

Context rot. Fixed by fresh context per iteration and state on disk.
Runaway cost and infinite loops. Fixed by an iteration cap, a completion promise, and a budget you set.
Thrashing on a bad or oversized task. Fixed by atomic tasks, one task per iteration, and the BLOCKED and DECIDE promise tags.
Silent wrong work. Fixed by verification gates and screenshots.
Sandbox escape or damage. Fixed by running the agent inside a Docker Sandbox.

The rest of this post takes each one in turn.

Context rot: the agent loses the plot

Context rot is the slow decay of an agent’s reasoning as its context window fills with stale and conflicting information. Old tool output, abandoned plans, your corrections, the agent’s own apologies, half-finished edits: all of it competes with the task that actually matters right now. The longer a single session runs, the worse the signal-to-noise ratio gets. You see it as repeated mistakes, forgotten constraints, and confident edits that quietly undo earlier good work.

This is the failure the Ralph technique was built to kill, so the guardrail is the core of the design rather than a bolt-on.

Fix: fresh context every iteration

Each iteration spawns the agent with a clean context window. It reads only what matters right now: the prompt, a short project summary, the task list, and the spec for the one task it is about to work. There is no transcript of the previous twelve iterations clogging the window. The model spends its attention on the current task instead of re-reading its own history.

Think of it as a series of sprints rather than one marathon. A marathon session accumulates fatigue. A fresh sprint each time stays sharp.

Fix: state lives on disk, not in chat

If the agent forgets everything between iterations, how does it make progress? Because progress does not live in the conversation. It lives in files. Ralph keeps state under .agent/:

.agent/
├── PROMPT.md      # Prompt sent to the agent each iteration
├── tasks.json     # Task lookup table
├── tasks/         # Per-task specs (TASK-{ID}.json)
├── logs/LOG.md    # Progress log
└── history/       # Per-iteration output logs

When a fresh agent boots, it reads .agent/tasks.json to see what is done, reads .agent/logs/LOG.md to see what happened recently, and reads the git log to see the actual committed changes. The code on disk is the record. The window never grows large enough to rot because it resets every pass. The deeper version of this design, including how the prompt file makes a fresh agent reorient in seconds, is in how to write the PROMPT.md file that drives a Ralph loop.

Runaway cost and infinite loops: the money fire

The naive Ralph loop is one line:

while :; do cat PROMPT.md | your-agent-cli; done

That loop never stops on its own. It has no iteration cap and no notion of success. Point it at a paid model and walk away, and it will happily keep calling the API until your bill says otherwise. Even when the work is genuinely done, a loop without a stop condition will keep re-running the agent to rediscover that there is nothing left to do, paying full freight each time.

Runaway cost has two shapes. One is the loop that never terminates. The other is the loop that terminates eventually but does far more iterations than the work needed. Both are guardrail problems, not model problems.

Fix: an iteration cap with -n

ralph.sh takes a hard cap on iterations. The default is ten. You set a higher one explicitly:

./ralph.sh -n 50
./ralph.sh --max-iterations 5

To smoke test the setup without committing to a long run, do exactly one pass:

./ralph.sh --once

When the loop hits the cap with work still pending, it exits with code 1 (MAX_ITERATIONS). That is not a crash. It is the safety net doing its job. You read the log, decide whether to top up the budget, and run it again.

Fix: a completion promise, not a vibe

The loop stops on an explicit signal, never on a guess. The agent emits a promise tag to declare where things stand:

<promise>COMPLETE</promise> every task is finished.
<promise>BLOCKED:reason</promise> the agent needs a human to clear something.
<promise>DECIDE:question</promise> the agent hit a real decision point.

ralph.sh watches for these tags and translates them into exit codes:

Code	Meaning
0	COMPLETE. All tasks finished.
1	MAX_ITERATIONS. Reached the cap.
2	BLOCKED. Needs human help.
3	DECIDE. Needs a human decision.

A COMPLETE ends the run cleanly the moment the work is actually done, so you do not pay for victory laps. The full design of the stop condition, and why a promise beats letting the agent decide it feels finished, is in completion promises and exit codes.

Fix: a budget you choose up front

Cost control is mostly about deciding limits before you start: how many iterations, which model, and what the loop is allowed to touch. A cheaper model on a well specified task can outperform an expensive model on a vague one, because the expensive model spends its budget flailing. The full treatment of caps, model choice, and the verification gates that keep a loop from spinning is in cost control for autonomous AI coding agents.

Thrashing on a bad or oversized task

Give an agent a task like “refactor the billing system” and it will thrash. The task is too big to hold in one iteration and too vague to verify, so the agent makes a sprawling change, second-guesses it, partially reverts, and produces a commit you cannot review. Repeat that across iterations and the loop spins without converging. This is the most common reason a loop “does not work,” and it is almost always a task design problem rather than an agent problem.

Fix: atomic tasks

The unit of work matters. A task should be small enough to finish in a single iteration and specific enough to verify. “Add a unit test for the discount calculation in cart.ts” is atomic. “Improve test coverage” is not. The loop is only as good as the tasks you feed it. Garbage tasks in, garbage commits out. How to decompose a large project into atomic, independently verifiable packets is the heart of spec-driven development, and it is upstream of everything the loop does.

Fix: one task per iteration

The rule that keeps a loop reliable is blunt: the agent completes exactly one task, commits, and stops. It never batches several tasks into one iteration. Batching is how an agent drifts off scope and produces a tangled diff. One task per invocation keeps each commit reviewable and each iteration’s blast radius small.

This pairs with the task lookup table. The agent reads .agent/tasks.json, picks the highest-priority incomplete task, opens its spec at .agent/tasks/TASK-{ID}.json, works only those steps, and exits. The table is a lightweight index, so a project can hold hundreds of tasks while each iteration loads only the one it needs.

Fix: BLOCKED and DECIDE instead of guessing

When a task is genuinely underspecified or the agent hits a wall it cannot clear, the worst outcome is for it to guess and burn iterations on a wrong assumption. The promise tags give it an honest exit. BLOCKED:reason stops the loop and hands you the blocker. DECIDE:question stops and asks for your call. The loop returns exit code 2 or 3, and you are not staring at a dead terminal wondering what happened. A loop that stops to ask is far cheaper than a loop that thrashes in silence.

Silent wrong work: it looks done but it is broken

The scariest failure is the one that looks like success. The agent flips a task to done, commits, and moves on, but the code does not actually work. Maybe it compiles and the tests it wrote test the wrong thing. Maybe the UI renders but the button does nothing. Across an overnight run, silent wrong work compounds: later tasks build on the broken one, and you wake up to a green task list and a red product.

The cause is letting the agent self-assess on a vibe. The fix is to take that judgment away from the agent and give it to a tool that cannot be fooled by optimism.

Fix: verification gates

Before the agent calls a task done, it runs the project’s checks. Ralph assumes a verification stack: Playwright for end to end tests, Vitest for unit tests, TypeScript for types, ESLint for linting, and Prettier for formatting. The repo mantra is exactly this blunt: if you didn’t test it, it doesn’t work.

The gate is binary. The checks pass or the iteration is not done. If a check fails, the agent loops back inside the same iteration and fixes the code until it passes. The agent does not get to claim success while the test suite is red. This is what makes a commit in the git log trustworthy rather than aspirational.

Fix: screenshots for the things tests miss

Some failures do not show up in a unit test. A layout that is technically rendered but visually broken passes a DOM assertion and fails a human glance. For UI work, the loop takes a screenshot as part of completing a task, so the artifact you review in the morning includes proof of what the agent saw. A screenshot is a cheap, honest signal that catches the class of bugs that slip past assertions.

Sandbox escape and damage to your machine

Autonomous agents run in bypass-permissions mode, often called YOLO mode, because stopping to approve every file write would defeat the point of an overnight loop. Claude Code calls this --dangerously-skip-permissions. The flag name is honest. An agent with your permissions and no approval prompts can read your SSH keys, touch files outside the repo, run arbitrary install scripts, and make network calls you never intended. On your laptop, that is a real risk. The mistake is treating the agent’s good behavior as the safety boundary.

Fix: the sandbox is the boundary

Run the agent inside an isolated Docker Sandbox microVM. The boundary is the sandbox, not the agent’s restraint. ralph.sh does this by default through the sbx CLI, giving each run a deterministic sandbox name of the form ralph-<agent>-<current-dir>-<hash8>. Inside that microVM, the agent can run in YOLO mode because the worst it can damage is the sandbox.

You can inspect and operate the sandbox like any container:

sbx ls
sbx exec -it <name> bash

Network is deny-by-default. The agent gets only the domains you allow:

sbx policy allow network <name> registry.npmjs.org

That last point matters for both safety and cost. A deny-by-default network blocks exfiltration and stops an agent from wandering off to install or call something you did not sanction. The isolation model is documented in the Docker Sandboxes docs, and the bypass flag the loop relies on is in the Claude Code docs.

Failure modes mapped to guardrails

Here is the whole picture in one view: each failure on the left, each guardrail on the right.

flowchart LR
    subgraph Failures["Failure modes"]
        F1["Context rot"]
        F2["Runaway cost / infinite loop"]
        F3["Thrashing on a bad task"]
        F4["Silent wrong work"]
        F5["Sandbox damage"]
    end
    subgraph Guards["Guardrails"]
        G1["Fresh context + state on disk"]
        G2["-n cap + completion promise + budget"]
        G3["Atomic tasks, one per iteration, BLOCKED / DECIDE"]
        G4["Verification gates + screenshots"]
        G5["Docker Sandbox microVM"]
    end
    F1 --> G1
    F2 --> G2
    F3 --> G3
    F4 --> G4
    F5 --> G5

Read it as a contract. If you have turned on the guardrail, the matching failure mode is handled. If a run went wrong, find which guardrail was missing or misconfigured.

A pre-flight checklist before a long run

Before you start a multi-hour or overnight loop, walk this list. It is the difference between waking up to a clean branch and waking up to a mess.

Cap the iterations. Pass -n with a number you are willing to pay for. Do not run the uncapped one-liner.
Test the setup with one pass. Run ./ralph.sh --once and read the output before trusting a long run.
Check your tasks are atomic. Each task should be finishable in one iteration and verifiable by a check. If you cannot write the acceptance criteria, the task is not ready.
Confirm the verification stack runs. Make sure tests, lint, and type checking actually execute in the project, because the loop leans on them as its truth signal.
Run in the sandbox. Keep the agent inside the Docker Sandbox and the network on deny-by-default. Allow only the domains the build genuinely needs.
Pick the right agent and model. Switch agents with --agent and pass model flags after a separator:

./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -- --model pro

Supported agents are claude (the default), codex, copilot, cursor, gemini, and opencode.

The honest framing

A Ralph loop is persistence with a memory and a stop button. Its failure modes are the failure modes of any long autonomous process: it forgets, it overspends, it chases a bad goal, it lies to itself about success, and it can break things it touches. The technique does not pretend these do not exist. It pairs each one with a guardrail and makes the guardrails the default.

None of this removes your judgment from the loop. You still write the tasks, set the budget, and review the commits. What the guardrails buy you is the confidence to let an agent grind for hours without standing over it. Get the five right and the loop becomes boring in the best way: it either finishes, or it stops and tells you exactly why.

Frequently asked questions

What are the most common ways a Ralph loop fails?

There are five common failure modes: context rot where the agent loses the plot over a long session, runaway cost or infinite loops, thrashing on a task that is too big or too vague, silent wrong work that looks done but is broken, and damage from running an agent without a sandbox. Each one maps to a specific guardrail in ralph.sh.

How does a Ralph loop avoid context rot?

It resets the context window every iteration and stores project state on disk. Because the agent reads the task list, logs, and git history each pass instead of carrying a growing conversation, the window never gets large enough to rot. Progress lives in files like tasks.json and commits, not in chat memory.

How do I stop a Ralph loop from burning money?

Set an iteration cap with -n, rely on the completion promise so the loop stops the moment all tasks are done, and choose your model and budget before you start. The loop exits with code 1 when it hits the cap, which is a safety net rather than a failure. Running a single pass with --once first confirms the setup before a long run.

Why does my agent thrash instead of finishing the task?

Almost always because the task is too large or too vaguely specified to finish and verify in one iteration. The fix is atomic tasks, one task per invocation, and clear acceptance criteria. When a task is genuinely blocked or needs a decision, the agent should emit a BLOCKED or DECIDE promise and stop rather than guess.

Is it safe to run an autonomous coding agent in YOLO mode?

Only inside a sandbox. Bypass-permissions mode is dangerous on your laptop because the agent can read credentials and touch files outside the repo. Run it inside a Docker Sandbox microVM with deny-by-default networking, so the boundary is the sandbox rather than the agent's good behavior.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

How to Write a PRD an AI Agent Can Actually Build From

Tue, 31 Mar 2026 00:00:00 GMT

A PRD an AI agent can build from is not a feature wishlist. It is three things in one document: goals (what to build and why), constraints (the stack, the boundaries, what is out of scope), and verifiable acceptance criteria (conditions the agent can check by running a command and reading the output). Drop any one of those and the agent fills the gap with a guess. The guess looks fine in the diff and breaks on the case nobody wrote down.

This post is the practical version: what goes in .agent/prd/PRD.md, what goes in the short SUMMARY.md the loop sends every iteration, how to draft both with the prd-creator skill, and how to write acceptance criteria an agent can actually verify. It assumes you already know the shape of spec-driven development with AI and want to write the document that sits at the top of it.

What a PRD for an AI agent actually is

A human PRD and an agent PRD overlap, but they are not the same document. A human reads a PRD, fills the gaps with judgment, and asks a teammate when something is unclear. An agent does none of that. It reads what is on disk, and when the spec is silent, it invents an answer and proceeds with full confidence. So the agent PRD has to do more work up front.

Three properties decide whether a PRD is buildable.

Goals are concrete, not aspirational. “Make onboarding delightful” is not a goal an agent can build toward. “A new user reaches the dashboard in three steps or fewer after submitting the signup form” is. State the outcome in terms something can observe.

Constraints are explicit. Name the framework, the data store, the auth approach, and the libraries you have already committed to. Name what is out of scope just as clearly. An out-of-scope section is a fence: without it, an agent on a long run happily adds a feature you never asked for and now have to maintain.

Acceptance criteria are verifiable. Each criterion is a condition the agent can confirm by running a test, hitting an endpoint, or reading a file. “Login works” is a vibe. “POST /api/login with a wrong password returns 401 and the body { error: 'Invalid credentials' }” is a criterion. The difference is whether a machine can return a yes or a no without your opinion.

The reason this matters more for agents than for people: an autonomous loop amplifies whatever you feed it. Feed it ambiguity and it amplifies ambiguity across every iteration. The Ralph technique runs an agent against your task list until the work is done, and the whole design rests on the agent rebuilding its understanding from files on each pass. If you want the mechanics of that loop, start with what is the Ralph technique. The PRD is the document the entire loop reads from.

Where the PRD lives: PRD.md and SUMMARY.md

Ralph keeps the product specification in two files under .agent/prd/, and the split is deliberate.

PRD.md is the full document. It is for depth: a human reads it once to understand the project, and the agent reads it when it needs detail it cannot get from the summary. The prd-creator skill writes it with a consistent set of sections:

App overview and objectives
Target audience
Success metrics and KPIs
Competitive analysis
Core features and user flows
Technical stack
Prerequisites and access
Security considerations
Assumptions and dependencies

SUMMARY.md is the short executive overview. This is the file that gets sent to the agent every iteration so it reorients fast without rereading the entire PRD. It contains an overall description of the project, the main features, the key user flows, and a short list of key requirements. Nothing more.

The economics drive the split. Every iteration starts the agent with a fresh context window, and tokens in that window cost money and attention. You do not want to spend that budget reloading a 3000 word PRD on every pass when a tight summary reorients the agent just as well. Long PRD for depth, short summary for the working context on every iteration.

flowchart LR
  Reqs["Unstructured requirements"] --> Interview["prd-creator interview, plan mode"]
  Interview --> PRD["prd/PRD.md: goals, constraints, criteria"]
  PRD --> Summary["prd/SUMMARY.md: short overview, sent each iteration"]
  PRD --> Tasks["tasks.json and tasks/TASK-ID.json"]
  Summary --> Loop["ralph.sh loop, fresh context each iteration"]
  Tasks --> Loop
  Loop --> Verify["Run tests, lint, types, screenshot"]
  Verify --> Commit["Commit, set passes true"]

The PRD feeds two children. The summary is the version the loop reads constantly. The task list is the executable decomposition. Get the PRD right and both children inherit clear goals and criteria. Get it vague and both inherit the vagueness.

Draft it with the prd-creator skill in plan mode

You do not write all of this by hand. Ralph ships a prd-creator skill that turns unstructured requirements into a PRD plus a task list. Run it in plan mode, where the agent is read-only and focused on asking questions instead of writing code. Plan mode matters here: you want the agent interrogating your idea, not racing ahead to scaffold files before the spec exists.

The flow is a conversation, not a one shot. The instinct most people have is to paste a paragraph and expect a finished plan. The skill instead pushes back. It interviews you to fill the gaps, asking clarifying questions one at a time, and it researches the competitive landscape before it commits anything to PRD.md. When a question can be answered by reading the codebase, it reads the codebase instead of asking you.

A prompt to kick it off looks like plain language:

Use the prd-creator skill in plan mode. I want to build a link shortener
with accounts, custom slugs, and click analytics. Interview me, write the
PRD to .agent/prd/PRD.md and the summary to .agent/prd/SUMMARY.md, then
generate the task list in .agent/tasks.json.

During the interview, the skill also verifies prerequisites and creates or updates .env.local with placeholder values only. It never writes a real secret to the PRD, the tasks, the logs, or .env.local. You fill the real values in by hand. This is the moment the spec records what credentials and access the project needs, so the agent is not discovering halfway through a 50 task run that it never had database access.

When it finishes, you have PRD.md, SUMMARY.md, and tasks.json with one TASK-{ID}.json spec per task. From there you run the loop:

npx @pageai/ralph-loop
./ralph.sh -n 50

You can amend later. When you want to add a feature or fix a bug mid-project, run the skill again to update the PRD and append tasks. The spec grows with the project instead of going stale the moment you start coding.

Write acceptance criteria the agent can verify

This is the part that separates a PRD an agent can build from a PRD that produces confident nonsense. An acceptance criterion is only useful if the agent can check it without you. The test is simple: can the agent confirm this by running something and reading the output? If not, rewrite it.

Three rules make a criterion verifiable.

Name the input and the expected output. Vague: “the endpoint validates email.” Verifiable: “POST /api/register with email ‘not-an-email’ returns 400 and the body { error: 'Please enter a valid email' }.” Now the agent can send the request, read the status and body, and compare.

Point at an observable artifact. “Passwords are secure” is unprovable. “The stored password starts with the bcrypt prefix $2b$ and never equals the plaintext value” can be confirmed by reading the row. Anchor the criterion to something the agent can inspect: a database row, a response header, a file on disk, a console exit code.

Make it pass or fail, never partial. A criterion that needs interpretation is a criterion the agent will interpret in its favor. “The UI looks clean” invites argument. “The submit button is disabled until both fields are non-empty” does not.

In Ralph, the criteria do not float in the PRD. They land in the per-task spec files. Each TASK-{ID}.json carries an acceptanceCriteria array, and the tests that prove those criteria are steps inside the task, not separate tasks scheduled for later.

{
  "id": "TASK-3",
  "title": "POST /api/auth/register creates a new user account",
  "category": "api-endpoint",
  "description": "Validate input, hash the password, store the user, return a success response.",
  "acceptanceCriteria": [
    "POST with valid email and password returns 201 with the user id and email",
    "Invalid email format returns 400 with the error text Please enter a valid email",
    "Password shorter than 8 characters returns 400",
    "Duplicate email returns 409, not a generic 500",
    "The stored password starts with $2b$ and is never the plaintext value"
  ],
  "steps": [
    {
      "step": 1,
      "description": "Add the register route handler",
      "details": "Validate with a zod schema, hash with bcrypt, insert into users.",
      "pass": false
    },
    {
      "step": 2,
      "description": "Write Vitest cases for every acceptance criterion",
      "details": "Cover valid registration, invalid email, short password, duplicate email, and the stored hash prefix.",
      "pass": false
    }
  ],
  "dependencies": ["TASK-1", "TASK-2"],
  "estimatedComplexity": "medium"
}

Every line in acceptanceCriteria maps to something the verification stack can confirm. The loop assumes that stack: Playwright for end to end, Vitest for unit tests, TypeScript for types, ESLint for lint, Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work. The agent runs the gate, and only after it passes does it flip passes to true, take a screenshot, and commit. Turning criteria into atomic, independently verifiable packets like this is its own discipline, covered in breaking a PRD into atomic agent tasks.

Name env vars, libraries, and user flows so the agent does not guess

The fastest way to get an agent to invent something wrong is to leave a decision implicit. Three categories cause the most trouble, and the PRD should pin all three.

Environment variables. List every variable the project reads, with a one line note on what each is for. The prd-creator skill writes placeholders into .env.local during prerequisite verification, so the agent knows the keys exist without ever seeing a real secret. Without this, an agent guesses variable names, scatters them across files, and you spend the morning reconciling DATABASE_URL against DB_CONNECTION_STRING.

Libraries and the stack. Say which framework, which ORM, which validation library, which test runner. If the project already uses zod, the PRD should say so, or the agent will reach for whatever it saw most recently in its training and add a second validation library next to your first. Naming the stack is also where you encode reuse: tell the agent to extend the existing auth module rather than write a parallel one.

User flows. A flow is a sequence of steps with branches, and the branches are where agents guess. “Users can reset their password” hides a dozen decisions. Does the reset link expire? After how long? What happens on an expired link? Does requesting a reset for an unknown email reveal that the account does not exist? Write the flow as steps and edge cases, and each edge case becomes an acceptance criterion instead of a surprise in production.

The pattern across all three: every decision you leave out is a decision the agent makes for you, silently, at the moment it is most expensive to change. The PRD is where you make those decisions while they are still cheap.

A short worked example

Take the link shortener from the prompt above and watch the three properties show up.

The goal is concrete: “an authenticated user creates a short link with an optional custom slug and sees total clicks per link.” Not “build a great link tool.” The constraint section names the stack and fences the scope: custom domains and team accounts are out of scope for version one. The acceptance criteria get specific per feature. For slug generation: “a generated slug is 7 characters of base62” and “a collision retries up to 3 times before returning an error.” For the redirect: “GET /:slug on an unknown slug returns 404” and “a valid slug records exactly one click row and issues a 302 to the target URL.”

The prd-creator interview is where these get extracted. Are slugs unique globally or per account? What happens on a collision? Do analytics count unique visitors or raw hits? Each answer becomes a line in the PRD, and the unanswered questions become the edge cases that would otherwise blow up the loop. By the time the PRD is approved, the ambiguity is gone, and the task list inherits criteria a machine can check. The loop then runs one task at a time, which is the rule that keeps a long run from drifting, explained in one task per iteration.

The framing of phases here (specify intent, plan the approach, decompose into tasks, implement and verify) comes from GitHub Spec Kit, and the loop that runs it autonomously was popularized by Geoffrey Huntley in his original Ralph writeup. The PRD is the artifact the first phase produces and every later phase consumes.

Where to go next

If you are writing the spec that drives a loop, read down through the spec-driven cluster:

Spec-driven development with AI for the full Specify, Plan, Tasks, Implement workflow this PRD sits inside.
Breaking a PRD into atomic agent tasks for turning the PRD into independently verifiable packets.
One task per iteration for the rule that keeps long runs reliable.

For the mechanics of the loop that reads your PRD on every pass, the fresh-context design, and where the technique came from, read what is the Ralph technique.

Frequently asked questions

What makes a PRD buildable by an AI agent rather than a person?

A buildable PRD states concrete goals, explicit constraints, and verifiable acceptance criteria. A person can fill gaps with judgment and ask a teammate, but an agent invents an answer whenever the spec is silent. So the agent PRD must name the stack, fence what is out of scope, and define each acceptance criterion as a condition a machine can confirm by running a command and reading the output.

What is the difference between PRD.md and SUMMARY.md in Ralph?

PRD.md is the full document with app overview, target audience, success metrics, core features and user flows, technical stack, prerequisites, security considerations, and assumptions. SUMMARY.md is a short executive overview with the main features, key user flows, and key requirements. The summary is what gets sent to the agent every iteration so it reorients fast without rereading the entire PRD.

How do I write acceptance criteria an agent can verify?

Name the input and the expected output, point at an observable artifact, and make every criterion pass or fail with no interpretation. For example, POST with an invalid email returns 400 with a specific error text, or the stored password starts with the bcrypt prefix and is never the plaintext value. The agent confirms each one with Playwright, Vitest, type checks, and lint before it marks the task done.

How do I create the PRD without writing it all by hand?

Use the prd-creator skill in plan mode. It interviews you one question at a time, researches the competitive landscape, verifies prerequisites, and writes placeholder values into .env.local without ever storing a real secret. It then writes PRD.md and SUMMARY.md and generates tasks.json with a prerequisite verification task first. You can run it again later to amend the PRD and append tasks.

Why does naming env vars, libraries, and user flows matter so much?

Every decision left implicit is a decision the agent makes for you, silently, at the moment it is most expensive to change. If you do not name the validation library, the agent may add a second one. If you do not spell out the password reset flow, it guesses how expiry and unknown emails behave. Listing variables, the stack, and flows with their edge cases turns those guesses into acceptance criteria.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Verification Loops: Why Autonomous Agents Need Tests and Screenshots

Fri, 27 Mar 2026 00:00:00 GMT

If you didn’t test it, it doesn’t work. That one rule is what makes autonomy safe. An AI agent that writes code with no way to check the result is just generating plausible text. The same agent wired to a verification loop (run the tests, read the failures, fix them, run again) can grade its own work and only mark a task done when the evidence says so. Verification is not a nice extra on top of an autonomous loop. It is the feedback signal the loop runs on.

This post is about that signal. What the verification stack looks like, why type checks and lint are the cheap gates you run first, why screenshots are the proof for UI work, and how the agent feeds a failing test back into the next iteration instead of declaring victory. The short version: every task ends with a result a machine can confirm, and the agent reads that result before it moves on.

Why verification is what makes autonomy safe

The danger of an autonomous agent is not that it writes bad code once. It is that it writes bad code, believes the code is fine, marks the task done, and builds the next ten tasks on top of the broken one. By the time you wake up, the loop has compounded a small mistake into a tangled diff. Verification is the thing that stops compounding. It forces the agent to confront reality at the end of every task.

An agent left to self-assess on vibes will tell you it is confident. Confidence is not evidence. A failing assertion is evidence. A red type error is evidence. A screenshot that shows the button in the wrong place is evidence. The job of a verification loop is to replace the agent’s opinion of its work with a result that came from running the work.

This is why the loop architecture and the verification stack are inseparable. The overnight run pillar covers how a Ralph loop keeps an agent productive for hours by resetting context and storing state on disk. None of that matters if the recorded state is a lie. Verification is what keeps status: done honest, so a fresh agent in the next iteration can trust the disk instead of re-checking everything. The mantra in the repo is blunt: if you didn’t test it, it doesn’t work.

In a Ralph loop, each iteration follows the same shape, and verification sits in the middle of it:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update task status, and commit.
Repeat until all tasks pass or the iteration cap is reached.

Step 3 is the gate. A task does not reach step 4 until step 3 is green. That single ordering is what separates an autonomous loop you can leave running from a code generator you have to babysit.

The verification stack: tests, types, lint, format, screenshots

The verification stack the loop assumes is five tools, each catching a different class of mistake. You run them as gates, fastest and cheapest first, so the agent gets a signal in seconds instead of waiting on a full browser run for a problem a type check would have caught.

TypeScript and ESLint are the cheap gates

Run the static checks first because they are fast and they catch the dumbest mistakes. A type error or a lint failure tells the agent the code is wrong before a single test boots a runtime.

# Type check: no emit, just verify the types hold
npx tsc --noEmit

# Lint: catch unused vars, bad imports, banned patterns
npx eslint .

These run in seconds and they fail loud. An agent that renamed a function but missed a caller gets a type error pointing at the exact file and line. That is a precise signal the agent can act on without guessing. Cheap gates first means the expensive gates (the browser tests) only run on code that already passes the basics.

Prettier keeps the diff reviewable

Formatting is not about taste in an autonomous loop. It is about keeping the morning diff readable. If every iteration reformats the file its own way, your git diff fills with noise and you cannot see what actually changed.

# Verify formatting without writing changes
npx prettier --check .

Run this as a gate and the agent is forced to leave the code in the canonical format. The reviewer (you) gets a diff that shows logic changes, not whitespace churn.

Vitest catches logic regressions

Unit tests are where the agent proves the logic does what the task said. Vitest runs fast enough to run on every iteration, which is the property that matters. A test suite you only run nightly is not part of the feedback loop.

# Run the unit suite once, no watch mode
npx vitest run

A unit test failure gives the agent the most actionable signal of all: an expected value, an actual value, and the exact assertion that broke. The agent reads that diff and knows precisely what its change got wrong. This is the difference between “something is off” and “the function returned 3 when the test expected 4”.

Playwright proves the user flow

For anything a user clicks through, unit tests are not enough. Playwright drives a real browser, so the agent verifies the flow end to end: navigate, fill the form, submit, assert the result on screen.

# Run the end-to-end suite headless
npx playwright test

End-to-end tests catch the class of bug that passes every unit test and still breaks in the browser: a missing prop, a broken route, a handler wired to the wrong element. They are slower, which is exactly why they sit last in the gate order. By the time Playwright runs, the cheap checks have already filtered out the obvious failures.

Screenshots are the proof for UI work

A test that passes tells you the DOM is correct. It does not tell you the page looks right. For UI work, the agent takes a screenshot and that screenshot is the proof. It is the one artifact that lets a human (or a vision-capable agent) confirm the thing actually renders the way the task described.

Playwright captures screenshots as part of a test run, so this folds into the same gate:

// Inside a Playwright test, capture proof of the rendered state
await page.goto('/dashboard');
await expect(page.getByRole('button', { name: 'Save' })).toBeVisible();
await page.screenshot({ path: 'artifacts/dashboard.png', fullPage: true });

Two reasons screenshots earn their place in the loop. First, they catch what assertions miss. A button can be present in the DOM, pass every toBeVisible check, and still sit behind a modal or off the edge of the viewport. The screenshot shows it. Second, they are the audit trail. When you review a long autonomous run in the morning, the screenshots are how you confirm each UI task landed without re-running anything yourself. That auditability is the same idea covered in observability for autonomous coding agents: you cannot trust what you cannot see, and a screenshot is the cheapest way to see it.

Screenshots and type checks sit at opposite ends of the cost spectrum. Type and lint checks are nearly free and run constantly. A full browser screenshot is expensive and runs once per UI task at the end. You want both: the cheap gates to fail fast on logic, the screenshot to confirm the pixels.

How verification results feed the next iteration

Here is the part that turns verification from a checkbox into a loop. The agent does not just run the gates and pass or fail. It reads the failing output and feeds it back into the next attempt. A stack trace, a failed assertion, a type error: each is structured feedback the agent uses to make the specific fix, then it runs the gates again.

flowchart TD
  Task["Read task spec and acceptance criteria"]
  Implement["Implement the change"]
  Verify["Run gates: tsc, eslint, prettier, vitest, playwright"]
  Pass{"All gates green?"}
  ReadFail["Read failing output: stack trace, assertion, type error"]
  Fix["Make the targeted fix"]
  Screenshot["Capture screenshot for UI work"]
  Commit["Update tasks.json, commit"]
  Task --> Implement --> Verify --> Pass
  Pass -->|no| ReadFail --> Fix --> Verify
  Pass -->|yes| Screenshot --> Commit

The inner cycle (verify, read failure, fix, verify) is the verification loop proper. It can run several times inside a single task before the gates go green. This is the same reason iteration beats a single shot in general: the agent sees its own mistake and corrects it instead of guessing once and hoping. The Ralph loop vs one-shot prompting comparison makes that case directly. A one-shot prompt has no failing test to read, so it cannot self-correct. A verification loop hands the agent a precise error message and a chance to act on it.

The quality of the feedback decides how well this works. Good test output names the file, the line, and the difference between expected and actual. That precision is what lets a fresh-context agent fix a bug it has never seen, because the failure itself carries enough information to locate and correct the problem. Vague output (a generic “something failed” with no detail) starves the loop of signal and the agent thrashes. Invest in error messages and assertions that say exactly what went wrong.

This also connects to how the loop keeps its memory honest. Because each task only commits after the gates pass, the git history and .agent/tasks.json become a trustworthy record. A later iteration that reads status: done does not need to re-verify; the commit is the receipt. That is the discipline described in context engineering for long-running agents, where the filesystem and git log are the memory layer. Verification is what makes that memory worth trusting.

How do you design machine-checkable acceptance criteria?

Verification only works if the task tells the agent what “done” means in terms a machine can check. An acceptance criterion like “the login page should work well” is useless to a loop. There is nothing to run. A criterion like “submitting valid credentials redirects to /dashboard and the Vitest suite for auth passes” is checkable. The agent can run it and get a yes or no.

The rule of thumb: every acceptance criterion in a .agent/tasks/TASK-{ID}.json spec should map to a gate. Write criteria that a test, a type check, or a screenshot can confirm. If you cannot point at the command that proves a criterion, the criterion is too vague to belong in an autonomous task.

What machine-checkable criteria look like in practice:

Bind to a named test. “The new parsePrice unit test passes” beats “prices parse correctly”. The agent runs npx vitest run and reads the result.
Bind to an end-to-end flow. “A user can add an item to the cart and the cart count shows 1” maps to a Playwright spec the agent can run.
Bind to a screenshot. “The settings page renders with the dark-mode toggle visible” is confirmed by capturing the page and checking the toggle is in frame.
Bind to the static gates. “No type errors, no lint errors, formatting clean” is the floor every task clears.

Criteria written this way are what let the agent finish a task without asking you. It knows it is done because the gates it was given are green. This is the same property that makes one task per iteration reliable: an atomic task with checkable criteria is a unit the agent can verify on its own, commit, and move past. A task with fuzzy criteria forces the agent to guess at completion, which is exactly when an autonomous loop goes off the rails.

A useful habit is to write the test first as part of the spec. When the task spec includes the failing test the agent must make pass, the acceptance criterion and the verification command are the same thing. The agent’s whole job becomes “turn this red test green”, and the loop can confirm completion mechanically.

Putting it together: verification as the loop’s nervous system

Verification is not a phase that runs after the work. It is the nervous system the autonomous loop runs through. Strip it out and every other part of the architecture loses its meaning. Fresh context per iteration is pointless if the agent records unverified work. Disk-based memory is a liability if the state on disk is wrong. Atomic tasks do not help if “done” is a guess.

The stack does the catching, cheapest gate first:

TypeScript and ESLint fail in seconds on the obvious mistakes.
Prettier keeps the diff clean so the morning review is fast.
Vitest proves the logic with precise, actionable assertions.
Playwright proves the user flow in a real browser.
Screenshots prove the UI renders the way the task described.

The loop does the correcting: run the gates, read the failures, make the targeted fix, run again, and only commit when everything is green. Acceptance criteria that bind to those gates are what let the agent grade itself and a human trust the result without redoing it.

Get this right and an autonomous run stops being a leap of faith. Every task in the morning diff is backed by a green suite and a screenshot. You are not trusting the agent. You are trusting the evidence it was forced to produce. That is what makes it safe to close the laptop and let the loop run.

Frequently asked questions

What is a verification loop for an AI coding agent?

A verification loop is the cycle where an agent implements a change, runs automated gates like type checks, lint, unit tests, and end-to-end tests, reads any failures, makes a targeted fix, and runs the gates again until they pass. Only then does it mark the task done and commit. The verification result, not the agent confidence, decides when a task is finished.

Which tests should an autonomous agent run on every iteration?

Run the cheap static gates first because they are fast: TypeScript with no emit to catch type errors, ESLint for code issues, and Prettier in check mode for formatting. Then run Vitest for unit logic and Playwright for end-to-end flows. Run them in that order so the slow browser tests only execute on code that already passes the basics.

Why do AI agents take screenshots when they finish UI work?

A passing test confirms the DOM is correct but not that the page looks right. A screenshot is proof that the UI actually renders the way the task described, catching things like an element hidden behind a modal or pushed off the viewport. Screenshots also serve as an audit trail, so a human reviewing a long run can confirm each UI task landed without re-running anything.

How do verification results feed back into the next iteration?

The agent reads the failing output: a stack trace, a failed assertion with expected and actual values, or a type error pointing at a file and line. That precise signal tells the agent what to fix. It makes the targeted change and runs the gates again. Because a task only commits after the gates pass, later iterations can trust the recorded status instead of re-checking the work.

What makes an acceptance criterion machine-checkable?

A machine-checkable criterion maps to a command that returns a yes or no. Bind each criterion to a named test, an end-to-end flow, a screenshot, or the static gates. For example, submitting valid credentials redirects to /dashboard and the auth Vitest suite passes is checkable, while the login page should work well is not. If you cannot point at the command that proves a criterion, it is too vague for an autonomous task.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

How to Run the Gemini CLI in an Autonomous Coding Loop

Mon, 23 Mar 2026 00:00:00 GMT

To run Google’s Gemini CLI as an autonomous coding agent, point Ralph at it with one flag: ./ralph.sh --agent gemini. Ralph runs the Gemini CLI in its non-interactive prompt mode inside a Docker Sandbox, starts the agent with a fresh context window every iteration, and keeps re-running it against your task list until the work is done or you hit the iteration cap. Pick a model after the -- separator, authenticate once inside the sandbox, and walk away.

This is the Gemini-specific walkthrough in the larger guide to agentic coding CLIs. The loop mechanics are identical to running the Codex CLI in a loop and running the Cursor CLI agent in a loop. Only the agent binary and its flags change.

Run the Gemini CLI in a loop with one flag

Ralph is a Bash script you point at a project. Claude is the default agent, so you switch to Gemini explicitly:

./ralph.sh --agent gemini

That runs 10 iterations, the default. Change the count when you want a longer unattended session or a single smoke test:

# 50 iterations
./ralph.sh --agent gemini -n 50

# exactly one iteration (good for a dry run)
./ralph.sh --agent gemini --once

# explicit cap
./ralph.sh --agent gemini --max-iterations 5

The short form -a works too: ./ralph.sh -a gemini -n 20. Supported agents are claude (default), codex, copilot, cursor, gemini, and opencode, so the same harness drives any of them with the same flags.

Under the hood, Ralph builds a Gemini command and runs it inside a Docker Sandbox. The expansion for ./ralph.sh --agent gemini looks like this:

sbx run --name ralph-gemini-<project>-<hash8> gemini . -- -p "$PROMPT_CONTENT"

The -p flag (long form --prompt) is the Gemini CLI’s non-interactive mode. It reads a prompt, does the work, prints a final message, and exits. That clean exit is what lets Ralph treat each iteration as a discrete unit instead of one long interactive session. See the Gemini CLI documentation for the full command surface.

Select a Gemini model after the — separator

Anything to the right of Ralph’s own -- separator is forwarded straight to the agent. For Gemini, Ralph inserts those arguments after the sandbox’s --, before the -p prompt. So this:

./ralph.sh -a gemini -- --model pro

expands to:

sbx run --name ralph-gemini-<project>-<hash8> gemini . -- --model pro -p "$PROMPT_CONTENT"

The --model flag (short form -m) picks the model for the run. Use the separator for any valid Gemini CLI flag, not just the model:

# pick a model
./ralph.sh -a gemini -- --model pro

# combine the model with a longer run
./ralph.sh -a gemini -n 50 -- --model pro

The rule to remember: everything left of -- configures Ralph (agent, iteration count, login). Everything right of -- configures Gemini. Keep your arguments on the correct side and the loop behaves.

Log in to Gemini inside the sandbox

Gemini runs inside an isolated Docker Sandbox, not on your host, so it needs credentials in that environment. Authenticate once with the login action:

./ralph.sh --login --agent gemini

This prints the login command for every supported agent, highlights the one for Gemini, and drops you into the sandbox shell. Inside, you run gemini once and complete its authentication (sign in with your Google account or set a GEMINI_API_KEY). The credential persists in that named sandbox, so later runs attach to the same box and start already authenticated.

Each agent gets its own deterministic sandbox name, derived from the agent slug, the project directory, and a hash of the absolute path:

ralph-<agent>-<project-dir>-<hash8>

For Gemini that is ralph-gemini-<project>-<hash8>. Print the exact name for your project without starting a run:

./ralph.sh --print-name --agent gemini

Per-agent names matter because they keep state separate. Your Gemini sandbox and your Claude sandbox never share credentials, history, or installed tools. If Gemini is not authenticated when the loop starts, Ralph watches for auth-failure patterns like API key not valid, stops, and tells you to run ./ralph.sh --login --agent gemini. No silent thrashing on a box that can never make progress.

What happens each iteration (fresh context)

Ralph’s loop is the Bash loop Geoffrey Huntley described in the original Ralph writeup. Each pass is mechanical and identical:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update the task status, and commit.
Repeat until all tasks pass or the iteration cap is reached.

The critical part is that each iteration spawns a fresh Gemini process with a clean context window. The agent does not carry a bloated, hours-long transcript from one task to the next. It reads the current state from disk, does one task, and exits. That is the fix for context rot, the failure mode where an agent slowly loses the plot over a long session.

flowchart TD
    Start(["./ralph.sh -a gemini -- --model pro"]) --> Pick["Pick top task from .agent/tasks.json"]
    Pick --> Spawn["sbx run gemini . -- -p (fresh context)"]
    Spawn --> Work["Gemini reads state from disk, edits files, runs commands"]
    Work --> Verify["Run tests, lint, type check, screenshot"]
    Verify --> Commit["Commit and update task status"]
    Commit --> Check{"Promise tag emitted?"}
    Check -->|"none"| Pick
    Check -->|"COMPLETE"| Done(["exit 0, all tasks done"])
    Check -->|"BLOCKED or DECIDE"| Stop(["exit 2 or 3, wants a human"])

A loop also needs a stop condition that is a signal, not a vibe. Gemini emits a semantic promise tag in its final message, and Ralph reads it:

<promise>COMPLETE</promise> means every task is finished.
<promise>BLOCKED:reason</promise> means the agent needs human help.
<promise>DECIDE:question</promise> means it needs a decision you have to make.

Those map to exit codes: 0 for COMPLETE, 1 for hitting MAX_ITERATIONS, 2 for BLOCKED, and 3 for DECIDE. Wire those into a wrapper script or a CI step and you get clean branching: ship on 0, page yourself on 2 or 3, extend the cap on 1.

One rule keeps the whole thing reliable: one task per invocation. Gemini completes exactly one task, commits, and stops. It never batches several tasks into a single iteration, which is what keeps each commit small, each diff reviewable, and each context window focused on a single goal.

Verify every iteration with tests and screenshots

A loop is only as good as its feedback. If Gemini cannot tell whether its change worked, it will happily mark a broken task done and move on. The repo mantra is blunt: if you didn’t test it, it doesn’t work.

Ralph assumes a verification stack and runs it inside step three of every iteration:

Playwright for end-to-end tests.
Vitest for unit tests.
TypeScript for type checking.
ESLint for linting.
Prettier for formatting.

Gemini runs those commands itself, reads the failures, and fixes them before committing. Screenshots add a second channel: the agent captures the UI state so you can eyeball the result in the morning instead of reading diffs blind.

Because every iteration starts fresh, verification is also how the next iteration learns what the last one did. The agent does not remember the previous run. It reads the test results, the updated task status, and the new commits, then decides what is next. That feedback loop is the whole point, and it works the same across every agent in this family.

Safe autonomy: the sandbox is the boundary

For an unattended loop there is nobody to approve a file write or a shell command, so the agent has to run without pausing for permission. On your laptop that is reckless. Inside a sandbox it is fine, because the blast radius is the microVM, not your machine.

A Ralph loop runs Gemini inside a Docker Sandbox: an isolated microVM with its own kernel, an isolated filesystem, and a network that is deny-by-default. The sandbox is the boundary you enforce, so you do not need the agent policing itself. For the full argument, including why a microVM beats a hand-rolled container, read how to run AI coding agents in Docker sandboxes safely.

When the agent needs a package, the deny-by-default network blocks it until you allow the domain:

sbx policy allow network ralph-gemini-<project>-<hash8> registry.npmjs.org

That is a feature, not a hurdle. The agent can install what a task needs without a path to reach arbitrary hosts or exfiltrate your source. The Docker Sandboxes documentation covers the policy model in full, including the global -g form and the "**" wildcard for the rare case where you want to open everything.

Inspect and debug the Gemini sandbox

When a run stalls or a task keeps failing, get inside the box. The sandbox is a normal container you can poke at. List what exists:

sbx ls

Open a shell in the Gemini sandbox and look around:

sbx exec -it ralph-gemini-<project>-<hash8> bash

From there you can check the working tree, re-run a failing test by hand, inspect installed tools, or read .agent/logs/LOG.md and the per-iteration logs in .agent/history/. Reattach to a sandbox session with:

sbx run ralph-gemini-<project>-<hash8>

Most stalls trace back to one of three things: Gemini was never authenticated (so every iteration fails the auth check), a network policy is blocking an install, or the prompt in .agent/PROMPT.md lacks a clear completion criterion. The sandbox shell shows you which one it is.

If you need to redirect a running loop without killing it, edit .agent/STEERING.md. Ralph reads it and folds critical work into the next iteration before resuming the normal task list. That is steering, not stopping, and it keeps momentum while you correct course.

Putting it together

A real Gemini loop, start to finish, is three commands:

# 1. authenticate once (creates the sandbox, you sign in inside it)
./ralph.sh --login --agent gemini

# 2. confirm the sandbox name for network policies and debugging
./ralph.sh --print-name --agent gemini

# 3. run the loop with a model, inside the sandbox boundary
./ralph.sh -a gemini -n 50 -- --model pro

That is Google’s Gemini CLI running unattended: a fresh context per iteration, state on disk, the microVM as the real boundary, and a hard stop on a completion promise. Define your tasks in .agent/tasks.json, write a clear .agent/PROMPT.md, and let it work through the list while you do something else.

Frequently asked questions

How do I run the Gemini CLI in a coding loop?

Use Ralph and pass the agent flag: ./ralph.sh --agent gemini. Ralph runs the Gemini CLI in its non-interactive -p prompt mode inside a Docker Sandbox, starts a fresh context window each iteration, and repeats until every task in .agent/tasks.json is done or the iteration cap is reached. The default is 10 iterations; raise it with -n 50.

How do I choose a Gemini model through Ralph?

Put it after the -- separator. Anything to the right of -- is forwarded to the agent, so ./ralph.sh -a gemini -- --model pro expands to gemini . -- --model pro -p with the Ralph prompt. The same separator works for any valid Gemini CLI flag.

How do I log in to Gemini inside the sandbox?

Run ./ralph.sh --login --agent gemini. It drops you into the sandbox shell where you run gemini once and complete its authentication, either by signing in with your Google account or setting a GEMINI_API_KEY. The credential persists in that named sandbox, so future runs attach to the same box already authenticated. The sandbox is named ralph-gemini-<project>-<hash8>.

Why does the loop start each iteration with a clean context?

Long sessions rot. An agent that carries an hours-long transcript loses track of the goal. Ralph spawns a fresh Gemini process every iteration with a clean context window, and the agent rebuilds its understanding from disk: .agent/tasks.json, the task spec files, .agent/logs/LOG.md, and the git history. That filesystem state is the memory layer, not the chat.

How does the loop know when to stop?

Gemini emits a promise tag in its final message. <promise>COMPLETE</promise> stops the loop with exit code 0, BLOCKED exits with 2, and DECIDE exits with 3. Hitting the iteration cap without completing exits with 1. You branch on those exit codes in a wrapper script or CI.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Completion Promises and Exit Codes: How a Ralph Loop Knows When to Stop

Thu, 19 Mar 2026 00:00:00 GMT

A Ralph loop stops on an explicit signal, not a guess. Each iteration the agent prints a promise tag, a short machine-readable status, and the loop reads it to decide whether to run again, stop, or hand control back to you. When the loop exits, ralph.sh returns a numeric exit code that tells you and any surrounding automation exactly why it stopped. No “looks done”, no agent declaring victory on a feeling. A signal the script can match, and a code your shell can branch on.

This matters because the alternative is an agent that loops forever, or one that quits the moment it gets tired of the task. The completion promise is the part of the Ralph technique that turns an open-ended loop into a finite, scriptable job.

How does a Ralph loop know when to stop?

It watches the agent’s output for a <promise> tag and reacts to it. There are three signals the agent can emit, and four primary exit codes the script can return. That is the whole stop mechanism.

The agent never decides when to terminate the process. It only reports status. The loop owns the decision to continue or halt, which keeps control in the script you can read and edit rather than buried inside the model.

The three promise tags

A promise tag is a semantic status the agent writes to its output. The format is fixed so the loop can pattern-match it reliably:

<promise>TYPE:content</promise>

The loop scans both the raw agent output and the final summary for these tags after every iteration. There are three that change control flow.

`<promise>COMPLETE</promise>`

This means every task is finished. The agent has worked through .agent/tasks.json, verified each one, committed, and found nothing left to do. When the loop sees COMPLETE, it prints a success banner and exits cleanly with code 0.

<promise>COMPLETE</promise>

COMPLETE is the happy path. It is also the only tag that should be earned, not asserted. The agent is instructed to emit it only after the task list is empty and the verification gates have passed, which is the difference between real completion and an agent that wants to stop.

`<promise>BLOCKED:reason</promise>`

This means the agent cannot continue without you. A missing credential, an ambiguous external dependency, a failing service it does not control. The agent attaches a human-readable reason after the colon, and the loop surfaces it instead of silently stalling.

<promise>BLOCKED:Missing API credentials for the payments service</promise>

When the loop detects a BLOCKED tag, it extracts the reason, plays a notification, prints the blocked message with the iteration number, and exits with code 2. You read the reason, fix the thing, and run the loop again. The point is that a blocked agent tells you why it is blocked rather than thrashing on a task it can never finish.

`<promise>DECIDE:question</promise>`

This means the agent has hit a real decision point and wants your call before it commits to a direction. Not a blocker, a fork. Two valid architectures, a naming convention that will ripple across the codebase, a tradeoff the spec did not pin down.

<promise>DECIDE:Should the new endpoint use REST or GraphQL?</promise>

The loop extracts the question, notifies you, and exits with code 3. You answer the question (usually by updating the task spec or .agent/STEERING.md), then restart the loop so a fresh agent picks up with the decision settled.

There is a fourth tag worth knowing, even though it does not stop the loop. <promise>TASK-{ID}:DONE</promise> reports that a specific task finished during the current iteration. The loop collects these for progress display (the Tasks: TASK-1, TASK-2 line you see after each pass) but keeps running. It is bookkeeping, not a stop signal.

Where the agent learns to emit these tags is the prompt file. The instruction to print COMPLETE, BLOCKED, or DECIDE lives in .agent/PROMPT.md, which is why writing a good PROMPT.md is what makes the promise reliable. A prompt that never tells the agent how to signal a blocker gets you an agent that fakes progress instead of asking for help.

The promise and exit code decision flow

Here is the control flow the loop runs after every iteration. The agent works one task, verifies it, commits, and prints its status. The loop reads the output and branches.

flowchart TD
    Start(["Start ralph.sh -n N"]) --> Run["Run iteration i: agent works one task"]
    Run --> Scan["Scan output and final summary for promise tags"]
    Scan --> Complete{"COMPLETE tag?"}
    Complete -->|Yes| Exit0["Exit 0 COMPLETE"]
    Complete -->|No| Blocked{"BLOCKED tag?"}
    Blocked -->|Yes| Exit2["Exit 2 BLOCKED, print reason"]
    Blocked -->|No| Decide{"DECIDE tag?"}
    Decide -->|Yes| Exit3["Exit 3 DECIDE, print question"]
    Decide -->|No| Cap{"i reached max iterations?"}
    Cap -->|Yes| Exit1["Exit 1 MAX_ITERATIONS"]
    Cap -->|No| Next["Increment i, fresh context"]
    Next --> Run

Read it as a priority order. COMPLETE wins first, then BLOCKED, then DECIDE. If none of the three fire and there is still iteration budget left, the loop spawns a fresh agent and runs again. Only when the budget runs out does it stop with the max-iterations code.

Exit codes and what to do with them

When ralph.sh returns, its exit code is the single source of truth for why it stopped. These are defined in scripts/lib/constants.sh:

Code	Name	Meaning
0	COMPLETE	All tasks finished and verified.
1	MAX_ITERATIONS	Hit the iteration cap with work pending.
2	BLOCKED	Agent needs human help.
3	DECIDE	Agent needs a human decision.
4	DOCKER_ERROR	The sandbox failed to start or run.
5	AUTH_ERROR	The agent is not authenticated.

Codes 0 through 3 map one to one onto the loop’s logical outcomes. Codes 4 and 5 are environment failures: the Docker Sandbox did not come up, or the agent CLI is not logged in. Treat 4 and 5 as setup problems to fix, not as something the loop did wrong.

A key point: exit code 1 is not a failure. It means the loop spent its iteration budget and there is still work in the queue, which is exactly what a safety cap is supposed to do. You read the log, decide whether to top up the budget, and run again.

Branching on the exit code in a script

Because the codes are standard, you can wrap the loop in a script and react to each outcome. A case statement on $? covers every branch:

#!/usr/bin/env bash
./ralph.sh -n 50
code=$?

case "$code" in
  0) echo "All tasks complete. Ship it." ;;
  1) echo "Hit the iteration cap. Topping up and rerunning." && ./ralph.sh -n 50 ;;
  2) echo "Blocked. A human needs to clear something." ;;
  3) echo "Decision needed. Check the question and update the spec." ;;
  4) echo "Sandbox failed to start. Check Docker." ;;
  5) echo "Not authenticated. Run ./ralph.sh --login first." ;;
  *) echo "Unexpected exit code: $code" ;;
esac

This is the difference between an agent loop you babysit and one you can automate. The script tells you whether to walk away, top up the budget, or step in.

Branching on the exit code in CI

The same property makes the loop usable in continuous integration or a scheduled job. CI treats exit 0 as success and everything else as failure by default, which is almost right. You usually want to fail the pipeline on a blocker or a decision, succeed on completion, and treat the iteration cap as a soft outcome that needs a human glance.

./ralph.sh -n 100
code=$?

# Complete is a pass. Blocked and Decide are hard failures.
# Max iterations is a soft outcome we flag but do not hard-fail.
if [ "$code" -eq 0 ]; then
  exit 0
elif [ "$code" -eq 1 ]; then
  echo "::warning::Ralph hit max iterations with work pending"
  exit 0
else
  echo "Ralph stopped with code $code"
  exit "$code"
fi

Running an agent unattended in CI only works because each agent runs inside an isolated Docker Sandbox microVM, so the loop can run in bypass-permissions mode without risking the host. The sandbox is the boundary, and the exit code is how the pipeline learns what happened inside it.

Why machine-verifiable completion beats “looks done”

The reason a Ralph loop emits COMPLETE only after tests pass, and not when the agent feels finished, is that agents are unreliable narrators of their own work. An agent will cheerfully tell you a feature is done while the build is red. “Looks done” is a vibe, and a loop driven by vibes either stops too early on broken code or never stops at all.

Machine-verifiable completion replaces the vibe with a gate. Before the agent is allowed to call a task done, it runs the project’s checks: Playwright for end to end tests, Vitest for unit tests, TypeScript for types, ESLint for linting, Prettier for formatting. The repo mantra is blunt: if you didn’t test it, it doesn’t work. A task is not done because the agent says so. It is done because the suite is green.

This is why the promise is trustworthy. COMPLETE is downstream of a passing verification stack, not upstream of it. The agent cannot emit a real completion signal for code that fails its own checks, because the gate sits between “I wrote it” and “I am done”. The deeper version of this argument, including how tests and screenshots feed an agent the signal it needs to self-correct, is in verification loops for AI agents.

Tie this back to the tags. COMPLETE is verified work. BLOCKED is honest failure with a reason. DECIDE is honest uncertainty with a question. None of the three is a guess. That is the whole design goal: every way the loop can stop is a signal you can act on, not a state you have to interpret.

How max iterations interacts with the promise

The promise and the iteration cap are two independent stop conditions, and you want both. The promise stops the loop when the agent reaches a logical endpoint. The cap stops the loop when it has run long enough regardless of what the agent thinks.

You set the cap with -n or --max-iterations. The default is ten:

./ralph.sh -n 50
./ralph.sh --max-iterations 5
./ralph.sh --once

--once is the special case of a cap of one. It runs a single iteration and stops, which is how you smoke test a setup before turning it loose.

Internally the loop is a counted loop, roughly for i in 1..N. On each pass it runs the agent, then checks for COMPLETE, BLOCKED, and DECIDE in that order. If any tag fires, it exits with the matching code immediately, before the cap is relevant. If no tag fires, the loop checks whether i has reached N. If it has, the loop falls out the bottom and exits 1. If not, it increments and runs a fresh agent.

So the two conditions race, and the promise almost always wins on a healthy run. The cap is the backstop for the cases where it does not: an agent stuck repeating a task without finishing it, a task list with a subtle dependency the agent keeps tripping over, a spec that is too vague to ever satisfy. Without a cap, those situations become a money fire that runs until you notice. With a cap, the worst case is a bounded, inspectable run that exits 1 and waits for you.

A practical pattern is to set the cap above your honest estimate of the task count, then read the exit code. Exit 0 means the agent finished inside the budget. Exit 1 means it did not, which is your cue to inspect the log and figure out whether the loop is making progress slowly or thrashing in place. Thrashing against the cap is one of the classic Ralph loop failure modes, and the cap is precisely the guardrail that keeps it from running up an unbounded bill.

One more interaction worth naming. BLOCKED and DECIDE short-circuit the cap entirely. If the agent hits a blocker on iteration three of a fifty iteration budget, the loop stops at three with code 2. It does not waste the remaining forty seven iterations re-discovering the same blocker, because a fresh-context agent would just hit the same wall. Stopping early and telling you the reason is the correct behavior.

Putting it together

The completion promise is a small mechanism with a large payoff. Three tags the agent emits, four primary exit codes the script returns, and a strict priority order between them. That is enough to make a loop both safe to leave running and easy to wire into automation.

The flow, end to end:

The agent works one verified task and prints a status tag.
The loop reads the tag and branches: COMPLETE exits 0, BLOCKED exits 2, DECIDE exits 3.
If no tag fires and budget remains, a fresh agent runs again.
If the budget runs out first, the loop exits 1.
Your script or CI reads the exit code and decides what happens next.

The promise is what stops the loop on a signal. Verification is what makes the COMPLETE signal honest. The iteration cap is what stops the loop when no signal comes. Use all three together and you get an agent you can run overnight and trust the exit code in the morning.

Frequently asked questions

What is a completion promise in a Ralph loop?

It is a machine-readable status tag the agent prints to its output, in the format <promise>TYPE:content</promise>. The loop scans for it after every iteration and uses it to decide whether to continue, stop, or hand control back to you. The three control-flow tags are COMPLETE, BLOCKED, and DECIDE.

What are the Ralph loop exit codes?

There are six. 0 means COMPLETE, all tasks finished and verified. 1 means MAX_ITERATIONS, the loop hit its iteration cap with work pending. 2 means BLOCKED, the agent needs human help. 3 means DECIDE, the agent needs a human decision. 4 means a Docker Sandbox error, and 5 means an authentication error. They are defined in scripts/lib/constants.sh.

Is exit code 1 a failure?

No. Exit code 1 means the loop reached its iteration cap while tasks were still pending, which is the safety cap working as designed. You read the log, decide whether to add more iterations, and run the loop again. It is not an error, just an unfinished run.

How does the loop avoid stopping on broken code?

The agent is only allowed to emit COMPLETE after the verification stack passes: Playwright, Vitest, TypeScript, ESLint, and Prettier. Completion is downstream of a green test suite, not upstream of it, so the agent cannot signal done on code that fails its own checks.

What is the difference between BLOCKED and DECIDE?

BLOCKED means the agent cannot continue without external help, like a missing credential or a failing service it does not control. DECIDE means the agent reached a fork it can technically pass but wants your input on, like choosing between two valid architectures. BLOCKED exits with code 2 and DECIDE exits with code 3.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Observability for Autonomous Coding Agents: Logs, History, and Live Output

Sat, 14 Mar 2026 00:00:00 GMT

You cannot trust what you cannot see. An autonomous coding agent that runs for hours while you sleep is a black box unless you instrument it, so the rule for AI agent observability is simple: make every iteration leave a trail you can read after the fact and a live signal you can read during the run. Ralph does both. It streams a parsed preview of what the agent is doing right now, classifies each line into a named step, writes a clean log per iteration to disk, records a running progress file, captures screenshots per task, and commits after every task so the git log doubles as an audit trail.

This is the observability piece of the larger guide to running an AI coding agent overnight. Long runs fail quietly when you have no visibility into them, so the surfaces below are what let you walk away from a loop and still know exactly what happened when you walk back.

Why observability is the difference between trust and hope

A single prompt finishes in seconds and you read the diff. A loop of 50 iterations runs for an hour or more, edits dozens of files, runs tests, and commits along the way. If the only thing you have at the end is a final message, you are hoping the agent did the right thing 50 times in a row. Hope is not a review process.

Observability replaces hope with evidence. Every claim the agent makes (“tests pass”, “task done”) should have a file you can open to verify it. Every minute the loop spends should map to a step you can name. Every commit should be small enough to read. When those three things are true, you can audit an overnight run in the time it takes to drink a coffee, and you can catch a loop that has gone sideways before it burns another twenty iterations.

The repo mantra applies here too: if you didn’t test it, it doesn’t work. The corollary for observability is that if you cannot see it, you cannot trust it.

What an autonomous coding agent should expose

Ralph wraps your chosen agent (claude, codex, cursor, gemini, copilot, or opencode) and turns its raw output stream into a set of observability surfaces. You start a run the usual way:

./ralph.sh -n 50

From that point on, five surfaces are live or being written. Here is each one and where it lives.

Live stream preview and step detection

While the agent works, Ralph reads its stream-json output line by line, parses out the text and tool calls, and shows two things under a spinner: the current step name and a dimmed rolling preview of the latest line. You are not watching a frozen spinner wondering if the process hung. You are watching a parsed feed of what the agent is touching right now.

The step name comes from a classifier that maps output patterns to one of fourteen named steps:

Thinking, Planning, Reading code, Web research, Implementing,
Debugging, Writing tests, Testing, Linting, Typechecking,
Installing, Verifying, Waiting, Committing

A line that calls a Write or Edit tool with a file path reads as Implementing. A vitest or playwright invocation reads as Testing. A git commit reads as Committing. An eslint or prettier run reads as Linting. The point is not perfect accuracy on every line. The point is that at a glance you know whether the agent is reading code, writing it, or running tests, without parsing raw JSON yourself.

The Waiting step is the one to watch. It fires on patterns like a question prompt or “blocked on”, which on an unattended loop usually means the agent is stuck asking for input that nobody is there to give.

Per-iteration history in .agent/history/

Every iteration writes its full output, with the ANSI color codes stripped out, to a timestamped file:

.agent/history/ITERATION-<session>-<n>.txt

The session id is a YYYYMMDD-HHMMSS stamp taken when the run starts, so a fresh run never overwrites the history of an earlier one. Iteration 7 of a run that started at 02:15:00 lands in ITERATION-20260314-021500-7.txt. To replay what the agent thought and did on a specific iteration, open that file. To scan the tail of the latest one while a run is going, point tail at the directory:

tail -f .agent/history/ITERATION-*.txt

This is the most underrated surface. The live preview is ephemeral, but the history file is the full, clean transcript of a single fresh-context iteration. When a task goes wrong three iterations back, this is where you find out why.

The progress log: .agent/logs/LOG.md

.agent/logs/LOG.md is the human-readable run journal. Ralph creates it on first run, and the agent appends an entry per task with the date, a brief summary, and the path to the screenshot it captured, newest entry at the top. It is the high-level story of the run, where the history files are the line-by-line detail.

Read it top to bottom in the morning and you get the narrative: what shipped, in what order, and where to look for the visual proof of each step.

# the story of the run, newest first
head -n 40 .agent/logs/LOG.md

Screenshots per task

Step four of every iteration is “complete the task, take a screenshot, update status, and commit.” The agent saves UI screenshots to .agent/screenshots/TASK-<id>-<index>.png and references that path in the log entry. For anything with a UI, this is the difference between trusting a green test and seeing the rendered result.

Screenshots also feed back into the loop. When the agent debugs a regression, it uses earlier screenshots as a reference for what the UI looked like before the change. That makes the screenshot folder both an audit artifact for you and a memory aid for the next fresh-context iteration.

Timing metrics per step and per iteration

Ralph times each iteration and each step inside it. After every iteration it prints the iteration duration, the delta against the previous iteration (green when faster, red when slower, in a stock-ticker style), a running average, and the total elapsed time. It also breaks the iteration down by step, sorted by time spent, so you can see that an iteration spent most of its minutes in Testing and Debugging rather than Implementing.

At the end of the run it prints session totals across all iterations. Timing is a cheap, powerful signal: iterations that keep getting longer, or that spend a growing share of time in Debugging, are a thrashing loop telling on itself before it blows your budget. Watching that trend is the core of cost control for autonomous AI coding agents.

Here is how the surfaces sit around a single iteration of the loop.

flowchart TD
    Agent["Agent runs (fresh context)"] --> Stream["stream-json output"]
    Stream --> Live["Live: spinner step + rolling preview"]
    Stream --> Detect["detect_step: Thinking / Implementing / Testing ..."]
    Stream --> Hist[".agent/history/ITERATION-(session)-(n).txt"]
    Agent --> Shots[".agent/screenshots/TASK-(id)-(index).png"]
    Agent --> Log[".agent/logs/LOG.md (newest first)"]
    Agent --> Commit["git commit per task"]
    Detect --> Timing["Per-step and per-iteration timing"]
    Agent --> Promise{"Promise tag?"}
    Promise -->|"none"| Next["Next iteration"]
    Promise -->|"BLOCKED or DECIDE"| Notify["Desktop notification + sound"]
    Promise -->|"COMPLETE"| Exit["exit 0"]

Git history as the audit trail

The strongest observability surface is one you already know how to read. Ralph follows one rule: one task per invocation. The agent completes exactly one task, commits, and stops, then the next iteration starts fresh. It never batches several tasks into a single commit. That discipline, covered in depth in the pillar on running an agent overnight, means the git log is a clean, chronological record of the whole run, one commit per finished task.

So your morning review is a normal code review:

# what landed overnight, one line per task
git log --oneline --since="12 hours ago"

# the full diff for a single suspicious task
git show <commit>

# everything since you walked away
git diff HEAD@{12.hours.ago}

Small commits keep each diff reviewable, which is the whole reason the one-task rule exists. A loop that commits 40 tiny, well-scoped changes is auditable. A loop that drops one giant commit at the end is not. Git history is also the agent’s memory layer: each fresh-context iteration reads recent commits, the task list, and the progress log to reorient, rather than carrying a bloated transcript forward.

Notifications when the agent needs you

Most of a loop runs without you. The two moments you actually need to know about are when the agent gets stuck or hits a fork it cannot resolve alone. Ralph surfaces both through promise tags the agent emits in its final message:

<promise>COMPLETE</promise> means every task is done. The loop exits with code 0.
<promise>BLOCKED:reason</promise> means the agent needs human help. The loop exits with code 2.
<promise>DECIDE:question</promise> means it needs a decision you have to make. The loop exits with code 3.

Hitting the iteration cap without completing exits with code 1. The full mechanics of how these signals stop a run, and why a loop should stop on an explicit promise rather than a vibe, are in the guide to completion promises and exit codes.

For observability, the important part is that BLOCKED and DECIDE are not silent. When either fires, Ralph plays a notification sound and sends a desktop notification (via osascript on macOS, notify-send on Linux, or PowerShell on Windows) with the reason or the question. You can leave a loop running in another workspace and trust that your machine will get your attention the moment a human is actually needed, instead of finding a stalled run an hour later. When you do get pulled in, you often do not need to stop the loop at all. You can redirect it with a STEERING.md file that injects work into a running agent.

How to read the signals and catch a stuck loop early

Visibility only helps if you know which patterns mean trouble. Here is how the surfaces combine into early warnings.

Iteration time climbing without new commits. If durations grow but git log shows no new tasks landing, the agent is spinning. The timing line and the commit log together catch this faster than either alone.

A step breakdown dominated by Debugging or Testing. A healthy iteration spends real time in Implementing. When the per-step breakdown is mostly Debugging across several iterations, the agent is fighting the same failure. Open the latest .agent/history/ file to see which one.

The Waiting step on an unattended run. Waiting means the agent is asking for input. With nobody at the keyboard, that iteration will not progress. This is a prompt problem: the task lacks a clear completion criterion, so the agent does not know it is allowed to finish.

A BLOCKED or DECIDE notification. This is the agent doing the right thing. It hit a wall it cannot or should not pass alone, emitted the promise, and stopped with a non-zero exit code. Read the reason in the notification and in the on-screen message, fix or decide, then run ./ralph.sh again to resume.

A flat LOG.md. If the progress log stops getting new entries while the run is still going, the agent is not completing tasks. Cross-check against the live step and the history file to see where it is stuck.

When the signals point somewhere you need to inspect directly, get inside the sandbox. Ralph runs each agent in an isolated Docker Sandbox, and you can open a shell in the running box to re-run a failing command, read the logs from inside, or check the working tree:

sbx ls
sbx exec -it <sandbox-name> bash

Print the exact sandbox name for your project with ./ralph.sh --print-name. Between the live stream, the per-iteration history, the progress log, the screenshots, the timing metrics, the git log, and the notifications, an autonomous run stops being a black box. You get a system you can audit while it runs and after it finishes, which is the only honest basis for letting an agent code unattended.

Frequently asked questions

What does observability mean for an autonomous coding agent?

It means every iteration leaves a trail you can read and a live signal you can watch. For a Ralph loop that is a parsed live stream with step detection, a clean per-iteration transcript in .agent/history/, a running progress log in .agent/logs/LOG.md, screenshots per task in .agent/screenshots/, timing metrics per step and per iteration, and one git commit per task. Together they let you audit a run while it happens and after it finishes.

Where does Ralph store per-iteration logs and history?

Each iteration writes its full output with ANSI codes stripped to .agent/history/ITERATION-<session>-<n>.txt, where session is a YYYYMMDD-HHMMSS stamp taken at the start of the run so new runs never overwrite old history. The higher-level run journal lives in .agent/logs/LOG.md, which the agent appends to per task with a date, a summary, and a screenshot path, newest entry first.

How do I watch what an AI agent is doing in real time?

Ralph reads the agent stream-json output line by line and shows a spinner with the current step name plus a dimmed rolling preview of the latest line. The step comes from a classifier that maps output to fourteen named steps such as Thinking, Implementing, Testing, and Committing. To follow the full transcript live, run tail -f on the .agent/history/ file for the current iteration.

How do I know when an autonomous agent needs me?

The agent emits a promise tag in its final message. BLOCKED means it needs help and exits with code 2, DECIDE means it needs a decision and exits with code 3, COMPLETE means it finished and exits with code 0, and hitting the iteration cap exits with code 1. When BLOCKED or DECIDE fires, Ralph plays a sound and sends a desktop notification with the reason, so you can leave the loop running and still be pulled in only when a human is required.

How do I catch a stuck or thrashing agent loop early?

Watch three things together. Iteration time climbing with no new commits in git log means the agent is spinning. A per-step breakdown dominated by Debugging across iterations means it is fighting the same failure. The Waiting step on an unattended run means the task lacks a clear completion criterion. When any of these show up, open the latest .agent/history/ file or shell into the sandbox with sbx exec to see exactly where it is stuck.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Running the Cursor CLI Agent in a Loop

Tue, 10 Mar 2026 00:00:00 GMT

To run the Cursor CLI agent in an autonomous loop, point Ralph at it with one flag: ./ralph.sh --agent cursor. Ralph wraps the headless cursor-agent in print mode, starts it with a fresh context window every iteration, and keeps re-running it against your task list until the work is done or you hit the iteration cap. Log in once inside the sandbox, pass Cursor’s own flags after a -- separator, and review a finished diff in the morning.

This is the Cursor-specific walkthrough in the larger guide to agentic coding CLIs. The loop mechanics are identical to running Claude Code in a loop; only the agent binary and its flags change.

Run the Cursor CLI agent in a loop with one flag

Ralph is a Bash script you point at a project. The default agent is Claude, so you switch to Cursor explicitly:

./ralph.sh --agent cursor

That runs 10 iterations, the default. Tune the count when you want a longer unattended run or a single dry run:

# 50 iterations
./ralph.sh --agent cursor -n 50

# exactly one iteration (good for a smoke test)
./ralph.sh --agent cursor --once

# explicit cap
./ralph.sh --agent cursor --max-iterations 5

The short form -a works too: ./ralph.sh -a cursor -n 5. Supported agents are claude (default), codex, copilot, cursor, gemini, and opencode, so the same harness drives any of them.

Under the hood, Ralph builds a cursor-agent command and runs it inside a Docker Sandbox. The expansion for ./ralph.sh --agent cursor looks like this:

sbx run --name ralph-cursor-<project>-<hash8> cursor . -- -p "$PROMPT_CONTENT"

The -p flag (long form --print) is Cursor’s headless mode. It prints the agent’s responses to the console for scripts and non-interactive use, and the Cursor CLI parameter reference is clear that print mode still has access to all tools, including write and shell. That is exactly what a loop needs: an agent that can edit files and run commands without a person in the chair. Print mode reads the prompt, does the work, prints a final message, and exits, and that exit is what lets Ralph treat each iteration as a discrete unit.

Set up and log in to Cursor inside the sandbox

Cursor runs inside an isolated Docker Sandbox, not on your host, so it needs credentials in that environment. Start by dropping Ralph into your project:

npx @pageai/ralph-loop

This adds the ralph.sh script and the .agent/ directory that holds the prompt, the task list, and the logs. Then authenticate once with the login action:

./ralph.sh --login --agent cursor

This prints the login command for every supported agent, highlights the one for Cursor, and then drops you into the sandbox shell. Inside, you authenticate Cursor once. Run cursor-agent login and follow the prompt, or provide a key through the CURSOR_API_KEY environment variable. Confirm the session with cursor-agent status. The credential persists in that named sandbox, so later runs attach to the same box and start already logged in.

Each agent gets its own deterministic sandbox name, derived from the agent slug, the project directory, and a hash of the absolute path:

ralph-<agent>-<project-dir>-<hash8>

For Cursor that is ralph-cursor-<project>-<hash8>. Print the exact name for your project without starting a run:

./ralph.sh --print-name --agent cursor

Per-agent names matter because they keep state separate. Your Cursor sandbox and your Claude sandbox never share credentials, history, or installed tools, so you can run both against the same repo without one clobbering the other. If Cursor is not authenticated when the loop starts, Ralph detects the auth failure, stops, and tells you to run ./ralph.sh --login --agent cursor. No silent thrashing on a box that can never make progress.

Pass a model and flags after the — separator

Everything to the right of Ralph’s own -- separator is forwarded straight to the agent. For Cursor, Ralph inserts those arguments right after -p, before the prompt. So this:

./ralph.sh --agent cursor -- --model auto

expands to:

sbx run --name ralph-cursor-<project>-<hash8> cursor . -- -p --model auto "$PROMPT_CONTENT"

The --model flag picks the model for the run. Do not guess at model names: run cursor-agent models (or pass --list-models) inside the sandbox to print the exact identifiers Cursor accepts, then pass one of those. The separator works for any valid cursor-agent flag, not just the model. A few you will reach for in a loop:

# pick a model (list them first with cursor-agent models)
./ralph.sh --agent cursor -- --model auto

# never pause to approve a command (alias: --yolo)
./ralph.sh --agent cursor -- --force

# emit newline-delimited JSON events instead of plain text
./ralph.sh --agent cursor -- --output-format stream-json

The rule to remember: everything left of -- configures Ralph (agent, iteration count, login). Everything right of -- configures the agent. Keep them on the correct side and the loop behaves.

Two of those flags deserve a note. -f (long form --force, alias --yolo) tells Cursor to allow commands unless a rule explicitly denies them, which is what keeps an unattended run from stalling on an approval prompt nobody is there to answer. And --output-format (which only works alongside --print) switches the stream from text to json or stream-json, useful when a downstream CI step parses events with jq. Ralph already records each iteration’s cleaned output to .agent/history/ and the running log to .agent/logs/LOG.md, so you get a per-iteration trail regardless of which format you pick.

What happens each iteration

Ralph’s loop is the Bash loop Geoffrey Huntley described in the original Ralph writeup. Each pass is mechanical and identical:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update the task status, and commit.
Repeat until all tasks pass or the iteration cap is reached.

The critical part is that each iteration spawns a fresh cursor-agent -p with a clean context window. The agent does not carry a bloated, hours-long transcript from one task to the next. It reads the current state from disk, does one task, and exits. This is why the loop deliberately does not use Cursor’s own --resume or --continue session flags: continuity is a liability here, not a feature.

flowchart TD
    Start(["./ralph.sh --agent cursor"]) --> Pick["Pick top task from .agent/tasks.json"]
    Pick --> Spawn["sbx run cursor . -- -p (fresh context)"]
    Spawn --> Work["cursor-agent reads state, edits files, runs commands"]
    Work --> Verify["Run tests, lint, type check, screenshot"]
    Verify --> Commit["Commit and update task status"]
    Commit --> Check{"Promise tag emitted?"}
    Check -->|"none"| Pick
    Check -->|"COMPLETE"| Done(["exit 0, all tasks done"])
    Check -->|"BLOCKED or DECIDE"| Stop(["exit 2 or 3, wants a human"])

The filesystem and git history are the memory layer. Progress lives in .agent/tasks.json, .agent/logs/LOG.md, per-task spec files, and the git log, not in a chat transcript. That separation of thinking (ephemeral, per iteration) from state (durable, on disk) is what keeps a fresh-context agent oriented across dozens of iterations. The deeper version of this idea is in the guide to running an AI coding agent overnight.

A loop also needs a stop condition that is a signal, not a vibe. Cursor emits a semantic promise tag in its final message, and Ralph reads it:

<promise>COMPLETE</promise> means every task is finished.
<promise>BLOCKED:reason</promise> means the agent needs human help.
<promise>DECIDE:question</promise> means it needs a decision you have to make.

One rule keeps the whole thing reliable: one task per invocation. Cursor completes exactly one task, commits, and stops. It never batches several tasks into a single iteration, which is what keeps each commit small, each diff reviewable, and each context window focused.

Isolate Cursor in a Docker Sandbox and review the diff in the morning

An autonomous agent that can write files and run shell commands is exactly as dangerous as the permissions it inherits. Run cursor-agent --force on your laptop and it can touch your SSH keys, your environment variables, and anything else your user can reach. The fix is not to make the agent more timid. The fix is to change the blast radius.

Ralph runs each agent inside a Docker Sandbox, an isolated microVM managed by the sbx CLI. Inside that boundary, Cursor runs in its --force (YOLO) mode without you previewing every command, because the sandbox is the boundary the agent cannot cross. The microVM has its own kernel, an isolated filesystem, and a network that is deny-by-default.

Cursor also ships its own internal sandbox, toggled with --sandbox enabled or --sandbox disabled. Running a second sandbox inside the microVM buys you nothing, because the microVM already contains the agent, and it can introduce nested-isolation friction. So a practical pairing for a loop is to let the microVM do the isolating and let Cursor focus on the coding:

./ralph.sh --agent cursor -n 50 -- --model auto --force --sandbox disabled

When the agent needs a package, the deny-by-default network blocks it until you allow the domain:

sbx policy allow network ralph-cursor-<project>-<hash8> registry.npmjs.org

That is a feature. The agent can install what the task needs without a path to exfiltrate your source or reach arbitrary hosts. The full argument, including how the microVM compares to a hand-rolled container, lives in how to run AI coding agents in Docker sandboxes safely, and the Docker Sandboxes documentation covers the policy model in detail.

This is the payoff of looping overnight. Because every task is committed separately, the morning review is a git review, not an archaeology dig:

git log --oneline
git diff main...HEAD

You read the commits in order, eyeball the screenshots the agent captured, and accept or revert. The work arrived as small, verified, individually committed units, so a single bad task is one revert, not a tangled mess you have to unwind by hand.

Verify every iteration with the test stack

A loop is only as good as its feedback. If Cursor cannot tell whether its change worked, it will happily mark a broken task done and move on. The repo mantra is blunt: if you didn’t test it, it doesn’t work.

Ralph assumes a verification stack and runs it inside step three of every iteration:

Vitest for unit tests.
Playwright for end-to-end tests.
TypeScript for type checking.
ESLint for linting.
Prettier for formatting.

Most projects wire these into npm scripts the agent calls during each iteration:

{
  "scripts": {
    "test": "vitest run",
    "test:e2e": "playwright test",
    "typecheck": "tsc --noEmit",
    "lint": "eslint .",
    "format": "prettier --check ."
  }
}

Because print mode has full tool access, Cursor can run those commands itself, read the failures, and fix them before it commits. A failed check sends the agent back to fix the work rather than forward to the next task. Screenshots add a second channel: for UI work, a passing suite is necessary but not sufficient, so the agent captures the rendered state as visual evidence you can review later.

Inspect and debug the Cursor sandbox

When a run stalls or a task keeps failing, get inside the box. The sandbox is a normal container you can poke at. List what exists:

sbx ls

Open a shell in the Cursor sandbox and look around:

sbx exec -it ralph-cursor-<project>-<hash8> bash

From there you can check the working tree, re-run a failing test by hand, confirm auth with cursor-agent status, or read .agent/logs/LOG.md and the per-iteration logs in .agent/history/. Reattach to a sandbox session with:

sbx run ralph-cursor-<project>-<hash8>

Most stalls trace back to one of three things: Cursor was never authenticated inside the box, a network policy is blocking an install, or the prompt in .agent/PROMPT.md lacks a clear completion criterion. The sandbox shell shows you which one it is. If you need to redirect a running loop without killing it, edit .agent/STEERING.md. Ralph folds that critical work into the next iteration before resuming the normal task list. That is steering, not stopping, and it keeps momentum while you correct course.

Putting it together

A real Cursor loop, start to finish, is three commands:

# 1. authenticate once (creates the sandbox, you log in inside it)
./ralph.sh --login --agent cursor

# 2. confirm the sandbox name for network policies and debugging
./ralph.sh --print-name --agent cursor

# 3. run the loop with a model and force mode, inside the sandbox boundary
./ralph.sh --agent cursor -n 50 -- --model auto --force --sandbox disabled

That is the Cursor CLI agent running unattended: fresh context per iteration, state on disk, write access granted because the microVM is the real boundary, and a hard stop on a completion promise. Define your tasks in .agent/tasks.json, write a clear .agent/PROMPT.md, start the loop, and read the commits in the morning.

Frequently asked questions

How do I run the Cursor CLI agent in a loop?

Use Ralph and pass the agent flag: ./ralph.sh --agent cursor. Ralph wraps the headless cursor-agent in print mode, runs it inside a Docker Sandbox, starts a fresh context window each iteration, and repeats until every task in .agent/tasks.json is done or the iteration cap is reached. The default is 10 iterations; raise it with -n 50.

How do I pass a model to the Cursor agent through Ralph?

Put it after the -- separator. Anything to the right of -- is forwarded to cursor-agent, so ./ralph.sh --agent cursor -- --model auto runs cursor-agent in print mode with that model and the Ralph prompt. Run cursor-agent models or --list-models inside the sandbox to see the exact model identifiers Cursor accepts.

How do I log in to Cursor inside the sandbox?

Run ./ralph.sh --login --agent cursor. It drops you into the sandbox shell where you run cursor-agent login, or you can set the CURSOR_API_KEY environment variable. The credential persists in that named sandbox, named ralph-cursor-<project>-<hash8>, so future runs attach to the same box already logged in.

Is it safe to run the Cursor agent in --force or --yolo mode?

It is unsafe on your laptop and reasonable inside a sandbox. Ralph runs cursor-agent in an isolated Docker Sandbox microVM with network denied by default, so force mode lets the agent run commands without pausing for approval while staying unable to touch your real files or exfiltrate data. The sandbox is the boundary, not the agent.

How does the loop know when to stop?

Cursor emits a promise tag in its final message. COMPLETE stops the loop with exit code 0, BLOCKED exits with 2, and DECIDE exits with 3. Hitting the iteration cap without completing exits with 1. You branch on those exit codes in a wrapper script or CI.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

How to Write the PROMPT.md File That Drives a Ralph Loop

Fri, 06 Mar 2026 00:00:00 GMT

PROMPT.md is the instruction the loop sends to the agent on every single iteration. It is not setup documentation and it is not a one-time kickoff message. It is the text a fresh agent reads at the top of each pass, with zero memory of the last one, so it has to make that agent reorient from disk and start working within seconds.

Get it right and a stateless agent reads its task list, picks one task, verifies the result, and commits, over and over, until the work is done. Get it wrong and the loop drifts: the agent batches tasks, forgets where state lives, or never signals that it finished. This post shows exactly what to put in .agent/PROMPT.md, with an annotated example you can copy.

If the loop itself is new to you, start with what the Ralph technique is and come back. The pattern was popularized by Geoffrey Huntley, who framed Ralph as a Bash loop that builds software while you sleep. The prompt file is the steering wheel of that loop.

Why does PROMPT.md matter so much in a Ralph loop?

The defining property of a Ralph loop is that each iteration starts the agent with a clean context window. There is no carryover transcript. The agent that runs iteration 31 knows nothing about iterations 1 through 30 except what it can read from disk. This is deliberate. It avoids context rot, where a long session drifts and the agent loses the plot as old output piles up in the window. For the deeper version of this idea, see context engineering for long-running agents.

The trade for that clean window is memory. A fresh agent remembers nothing, so the filesystem and git history become the memory layer. PROMPT.md is the bridge between the two. It is the only text guaranteed to be in front of the agent at the start of every pass, and its whole job is to point an amnesiac agent at where its memory lives and tell it what to do next.

ralph.sh re-sends the same PROMPT.md on each iteration. (Anthropic shipped an official Claude Code plugin that does this with a Stop Hook that re-injects the prompt. Our implementation is the hackable ralph.sh script, which loops the agent directly.) Because the text is identical every pass, it has to be written for an agent that is reading it for the first time, every time. That single constraint drives everything else in this post.

What does every PROMPT.md need to include?

A working prompt file does four jobs. Miss any one of them and the loop stalls in a predictable way.

Tell the agent where its state lives

A fresh agent cannot remember what it did, so the first thing the prompt must do is hand it a map of disk. Reference the files by path so the agent reads them before doing anything:

.agent/prd/SUMMARY.md: the condensed description of what is being built. This is the “why.”
.agent/tasks.json: the task lookup table, the list of work with passes flags.
.agent/tasks/TASK-{ID}.json: the full spec for a single task, including its steps and acceptance criteria.
.agent/logs/LOG.md: the running log of past iterations, newest at the top, so the agent can see what already happened.
.agent/STEERING.md: critical work injected mid-run that the agent handles before resuming the task list.
.agent/STRUCTURE.md: the current directory layout, so the agent does not reinvent paths.

The agent reorients by reading these in order. The summary tells it the goal, the log tells it the recent past, the task table tells it what is left.

Enforce one task per iteration

This is the rule that makes the whole thing reliable. The agent picks the highest-priority task with passes: false, works only that task, commits, and stops. It never batches. Batching is the most common way a loop goes off the rails, because a single commit that touches five tasks is impossible to verify and impossible to revert cleanly. One task per invocation keeps each step small enough to test and small enough to bisect later. The whole rule is worth its own read in one task per iteration.

Define how the agent verifies its own work

A loop is only as good as its verification, because tests are what gate progress without you in the chair. The prompt has to spell out the stack the agent runs before it commits: Vitest for unit tests, Playwright for end to end, tsc for types, ESLint for lint, Prettier for format. For UI work, the prompt should require a Playwright smoke test and a saved screenshot so the agent has visual ground truth. The repo mantra is blunt: if you didn’t test it, it doesn’t work.

Tell the agent how to signal completion

The loop stops on an explicit signal, not a vibe. The prompt defines the promise tags the agent emits so ralph.sh knows what happened:

<promise>TASK-{ID}:DONE</promise>   one task finished, stop this iteration
<promise>COMPLETE</promise>         all tasks finished, exit the loop
<promise>BLOCKED:reason</promise>   needs human help
<promise>DECIDE:question</promise>  needs a decision

Those tags map to exit codes, which is what makes the loop scriptable in CI or a cron job:

./ralph.sh -n 50
echo $?   # 0 COMPLETE, 1 MAX_ITERATIONS, 2 BLOCKED, 3 DECIDE

The per-task DONE tag is what ends a single iteration. The COMPLETE tag is what ends the entire run. Both belong in the prompt, and the loop logic depends on the agent emitting them exactly. The full mechanics live in completion promises and exit codes.

A concrete annotated PROMPT.md structure

Here is the shape of a real .agent/PROMPT.md, trimmed to the load-bearing parts. Notice how it front-loads the one-task rule, points at state with @ file references, lays out a numbered task flow, then closes with hard rules and help tags.

> ONE TASK PER INVOCATION. Complete one task from @.agent/tasks.json,
> commit, output <promise>TASK-{ID}:DONE</promise>, and STOP.

## Overview
You are implementing the project described in @.agent/prd/SUMMARY.md

## Required Setup
Run `npm run dev` (as a background process) in the `src` directory.

## Before Starting
Check @.agent/STEERING.md for critical work. Handle it in sequence,
remove when done, then proceed to tasks.

## Task Flow
1. Pick the highest-priority task with `passes: false` in tasks.json
2. Read the full spec: .agent/tasks/TASK-{ID}.json
3. Check existing structure in @.agent/STRUCTURE.md
4. Implement step by step and write a unit test
5. UI tasks: Playwright smoke test, save a screenshot, verify it
6. Run eslint --fix, prettier --write, and e2e for affected files
7. Run tsc and unit tests project-wide
8. All tests must pass. Broke an unrelated test? Fix it first.
9. Set `passes: true` in tasks.json for the completed task
10. Log entry to .agent/logs/LOG.md (date, summary, screenshot path)
11. Commit using the Conventional Commit format

## Rules
- Only ONE task per invocation. After committing, output the DONE
  promise and STOP. Do NOT read the next task.
- Kill background processes before the promise tag.
- No git push.
- When ALL tasks pass, output <promise>COMPLETE</promise> and nothing else.

## Help Tags
- BLOCKED: environment issues you cannot fix from the sandbox
- DECIDE: a real decision a human has to make

A few things are doing the heavy lifting here, and they are worth calling out:

The very first line is the one-task rule in a blockquote. It is the first and last thing the agent sees, because batching is the failure mode that wrecks the most runs.
Every state file is referenced with @ so the agent loads it. The prompt does not paste the contents inline. It points, and the agent reads fresh each pass.
The task flow is numbered and verification is steps 6 through 8, before the commit at step 11. Verification is not optional and it is not last.
The Before Starting step makes STEERING.md the first thing checked, so you can steer a running loop by editing one file mid-run.

This is how PROMPT.md feeds each iteration. The same text re-enters a fresh context every pass, and the agent rebuilds its understanding from disk before touching code:

flowchart TD
    P["PROMPT.md (re-sent every iteration)"] --> S["ralph.sh starts an iteration"]
    S --> R["Fresh-context agent reads PROMPT.md"]
    R --> O["Reorient from disk: SUMMARY.md, tasks.json, LOG.md, STEERING.md"]
    O --> T["Pick one task with passes:false"]
    T --> W["Implement, run tests, lint, types"]
    W --> G{"All gates pass?"}
    G -->|"No"| B["Fix next pass, or emit BLOCKED"]
    G -->|"Yes"| C["Commit, set passes:true, emit DONE"]
    C --> Q{"All tasks done?"}
    Q -->|"No"| P
    Q -->|"Yes"| D["Emit COMPLETE, loop exits 0"]
    B --> P

The arrow from the “No” branches back up to PROMPT.md is the entire point. Nothing carries over in the agent’s head. The prompt and the files on disk are what survive between passes.

How do you switch modes by editing PROMPT.md?

The default PROMPT.md is written for implementation: build the next task, test it, commit. But the loop is mode-agnostic. The agent does whatever the prompt tells it to, so you change the loop’s behavior by editing one file. The state files, the one-task rule, and the promise tags stay the same. Only the task flow changes.

A few practical modes:

Implementation (default): pick a passes: false task, build it, verify, commit. This is the shape shown above.
Refactor: keep behavior identical, improve structure. The task flow becomes “refactor one module, prove behavior is unchanged by running the existing tests, commit.” The acceptance gate is that no test changed and all of them still pass.
Review: read the diff from the last run and flag issues instead of writing features. The task flow becomes “read recent commits, check against the PRD and the code standards, write findings to a file, commit the report.” No production code changes.
Test backfill: write missing tests for code that already exists. The task flow becomes “pick an untested module, write unit and e2e coverage, confirm the suite passes, commit.”

Because the mode lives entirely in PROMPT.md, you can keep a few prompt variants in version control and swap them in for a run. Run an implementation pass overnight, then point the same loop at a review prompt in the morning to audit what it built. The loop machinery in ralph.sh does not change. You are only changing the instruction.

What are the common PROMPT.md mistakes?

Most broken loops trace back to a prompt that breaks one of the four jobs above. These are the ones that show up most.

Writing a prompt that is too vague

“Build the feature and make it good” gives a fresh agent nothing to act on. It does not know which task, where the spec is, or what “good” means. A vague prompt produces a vague run: the agent improvises, picks an arbitrary task, skips verification, and you wake up to a branch you cannot trust. Be concrete. Name the files, name the rule, name the gates. The agent can only be as precise as the prompt.

Assuming the agent remembers anything

This is the subtle one. It is tempting to write “continue where you left off” or “you already set up the database.” A fresh-context agent did not leave off anywhere and does not remember setting anything up. Every instruction that assumes memory is an instruction the agent cannot follow. Write the prompt as if the agent has never seen the project, because on every iteration, it has not. Push all state to disk and point the prompt at it.

Letting the agent batch tasks

If the prompt does not aggressively enforce one task per iteration, a capable agent will try to be helpful and knock out three tasks in one pass. That produces a fat commit that mixes concerns, fails verification in a way that is hard to localize, and cannot be reverted without losing good work. The fix is to repeat the one-task rule at the top and in the rules section, and to require the DONE promise plus a hard stop after the commit. When batching does slip through, it is usually one of the documented Ralph loop failure modes, and the fix is structural rather than throwing more iterations at it.

Skipping the verification gate or the completion signal

A prompt with no test gate degrades the loop toward one-shot prompting, because there is nothing to stop a bad commit from landing. A prompt with no completion signal means the loop runs to its iteration cap even when the work is done, burning tokens for nothing. Always specify the verification stack and always specify the promise tags. Those two pieces are what let you start a run and walk away.

A short checklist before you commit your PROMPT.md

Run through this before you point ralph.sh at a real project:

Does the prompt name every state file the agent needs (SUMMARY.md, tasks.json, the task spec, LOG.md, STEERING.md)?
Is the one-task rule stated at the top and enforced in the rules?
Does it list the exact verification commands and require them before the commit?
Does it define the promise tags and require a stop after the per-task DONE?
Does it read cleanly to an agent that has never seen the project before?

If all five are yes, a fresh-context agent can reorient and make progress on every pass. That is the entire job of the file.

Frequently asked questions

What is PROMPT.md in a Ralph loop?

It is the instruction the loop sends to the agent on every iteration. The script ralph.sh re-sends the same PROMPT.md each pass, and because each iteration starts with a fresh context window, the prompt has to make a stateless agent reorient from disk and act. It points the agent at its state files, enforces one task per iteration, defines verification, and specifies the promise tags that signal completion.

Where do I put the PROMPT.md file?

It lives at .agent/PROMPT.md in your project, alongside the other state files the loop reads: tasks.json, the per-task specs in tasks/, prd/SUMMARY.md, logs/LOG.md, and STEERING.md. Running npx @pageai/ralph-loop scaffolds the .agent directory with a default implementation-mode prompt you can edit.

How do I change what the loop does without changing the script?

Edit PROMPT.md. The loop is mode-agnostic, so the agent does whatever the prompt tells it. Swap the task flow to refactor, review, or backfill tests while keeping the one-task rule, the state references, and the promise tags the same. The ralph.sh machinery does not change, only the instruction does.

Why does PROMPT.md have to repeat the one-task rule?

Because a capable agent will try to be helpful and batch several tasks into one pass, which produces a fat commit that is hard to verify and hard to revert. Stating the rule at the top and in the rules section, then requiring a DONE promise and a hard stop after the commit, keeps each iteration to a single verified task.

What happens if my PROMPT.md has no completion signal?

The loop runs until it hits the iteration cap and exits with code 1 (MAX_ITERATIONS) even when the work is actually done, wasting tokens. Always define the promise tags so the agent can emit COMPLETE when all tasks pass, which exits the loop with code 0, and BLOCKED or DECIDE when it needs a human.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough

Cost Control for Autonomous AI Coding Agents

Mon, 02 Mar 2026 00:00:00 GMT

An autonomous coding loop burns money when it thrashes: it retries the same broken task, drags a bloated context window from one call to the next, or runs a frontier model on work a cheap model could finish. You control spend with three knobs, not a billing dashboard. Cap the iterations so the run has a hard ceiling. Pick the model per agent so mechanical tasks do not pay frontier prices. Gate every iteration on tests, lint, and type checks so the loop stops working a task it cannot finish instead of grinding on it forever.

This is the cost-control chapter of the larger guide to running an AI coding agent overnight. The mechanics below are the same Bash loop Geoffrey Huntley described, with the spend-shaped edges called out.

Why an autonomous loop costs more than one prompt

A single prompt costs one call. A loop costs one call per iteration, and a badly built loop multiplies that in two ways.

First, runaway iterations. If nothing tells the loop to stop, it keeps spawning the agent. Ten iterations is fine. Two hundred iterations on a task that was already done at iteration three is two hundred calls you paid for and zero you needed.

Second, context bloat. A naive long-running agent keeps appending to one conversation. Tokens accumulate, every turn re-sends the whole transcript, and the per-call cost climbs as the session drags on. This is also where quality falls apart, because the agent loses the plot in a wall of stale context. The Ralph technique fixes both at once: each iteration starts the agent with a fresh context window and reads state from disk. Cost per iteration stays flat instead of growing with session length. The deeper version of that argument is in the writeup on what the Ralph technique is.

So cost control is not a separate feature you bolt on. It falls out of building the loop correctly. The rest of this post is the specific knobs.

Cap iterations with -n, —once, and —max-iterations

The iteration count is your budget dial. It is the hard ceiling on how many agent calls a single run can make. Ralph defaults to 10:

# 10 iterations, the default
./ralph.sh

Set the cap explicitly when you want a longer unattended run or a tighter leash:

# 50 iterations for an overnight run
./ralph.sh -n 50

# the long form does the same thing
./ralph.sh --max-iterations 5

For a smoke test, run exactly one iteration. This is the cheapest possible way to check that the agent authenticates, reads your prompt, picks up a task, and produces a sane diff before you commit to a long run:

# exactly one iteration
./ralph.sh --once

Treat --once as your dry run. Spend one call, read the diff and the log, and only then scale the count up. A cheap probe at the start saves you from discovering at iteration 40 that the prompt was wrong.

The cap matters because it converts an open-ended process into a bounded one. Without it, “run until done” can mean “run until your bill scares you.” With it, the worst case is exactly the number of iterations you authorized. When the loop hits the cap without finishing, it exits with code 1 (MAX_ITERATIONS). That is a signal, not a failure: read the log, decide whether to raise the cap, and resume.

Match the model to the task

The second knob is model choice, and it is where most of the savings live. Frontier models cost more per token than smaller ones. Plenty of agent work does not need a frontier model: renaming symbols, wiring up boilerplate, fixing a failing lint rule, applying a mechanical refactor across files. Running those on a cheap model and saving the expensive model for genuinely hard reasoning is the single biggest lever on a long run.

Ralph forwards anything after the -- separator straight to the agent, so you choose the model with the agent’s own flag:

# Codex on a specific model
./ralph.sh --agent codex -- --model gpt-5.5

# Gemini on the pro model
./ralph.sh -a gemini -- --model pro

The rule to remember: everything left of -- configures Ralph (which agent, how many iterations, login). Everything right of -- configures the agent (model, approval mode, and so on). Keep them on the correct side and the loop behaves.

This works across the supported agents, which are claude (default), codex, copilot, cursor, gemini, and opencode. Each has its own sandbox and its own credentials, so you can keep separate runs for separate model tiers. A practical pattern: point a cheap-model loop at a queue of mechanical tasks, and a frontier-model loop at the hard architectural ones. The per-agent CLIs are covered in detail in the Codex loop walkthrough and the Claude Code loop walkthrough.

Two things make model selection safe rather than risky. The work is decomposed into small tasks (see below), so a cheaper model is asked to do something small enough that it can actually succeed. And every iteration is verified, so if the cheaper model produces a broken change, the gate catches it instead of letting it ship.

Verification gates stop the loop from thrashing

The most expensive failure mode is not a model that costs too much per call. It is a loop that keeps calling the agent on a task it cannot complete. Verification gates are what turn that infinite drip into a clean stop.

Every iteration runs the verification stack as step three of the loop:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update the task status, and commit.
Repeat until all tasks pass or the iteration cap is reached.

The stack Ralph assumes is Playwright for end-to-end tests, Vitest for unit tests, TypeScript for type checks, ESLint for linting, and Prettier for formatting. The repo mantra is blunt: if you didn’t test it, it doesn’t work.

Here is why this is a cost control and not just a quality control. A task only flips to done when its checks pass. If the change is broken, the gate fails, the task stays open, and the agent gets honest feedback to fix it on the next pass. Without gates, the agent can mark a broken task done and move on, or worse, keep editing the same file with no signal about whether it is closer or further from working. That second case is the thrash: real calls, real tokens, zero progress. Gates give the loop a definition of progress, so a task either advances toward passing or you find out fast that it is stuck.

When a task genuinely cannot be finished, you do not want the loop to spend the rest of its cap discovering that one iteration at a time. The agent emits a promise tag instead:

<promise>COMPLETE</promise> means every task is finished.
<promise>BLOCKED:reason</promise> means it needs human help.
<promise>DECIDE:question</promise> means it needs a decision from you.

Those map to exit codes: 0 for COMPLETE, 1 for MAX_ITERATIONS, 2 for BLOCKED, and 3 for DECIDE. A BLOCKED or DECIDE exit ends the run early instead of burning the remaining iterations. You spend on the calls that made progress and stop on the call that hit a wall. The full treatment of why an autonomous agent needs this feedback is in verification loops for AI agents.

Atomic tasks keep each iteration cheap

Cost per iteration tracks how much the agent has to read and reason about in that single call. A vague, sprawling task forces the agent to load a lot of context, take many steps, and produce a large risky diff that is more likely to fail verification and get retried. A small, well-scoped task is cheap to load, fast to finish, and easy to verify.

So the breakdown of work is itself a cost lever. Ralph follows one rule per invocation: the agent completes exactly one task, commits, and stops. It never batches several tasks into a single iteration. That keeps each commit small, each diff reviewable, and each context window focused on one thing.

The state lives on disk, not in chat history. A task lookup table (.agent/tasks.json) points to individual task specs (.agent/tasks/TASK-{ID}.json), and the running log lives in .agent/logs/LOG.md. Because progress is on the filesystem and in the git log, a fresh-context agent reorients at the start of every iteration without re-reading a giant transcript. That is what keeps the token cost of iteration 40 the same as iteration 2.

The completion promise is the other half of this. When the work is genuinely done, the agent emits <promise>COMPLETE</promise> and the loop exits with code 0, even if you authorized 50 iterations and it finished in 12. You only pay for the iterations the work actually needed. The cap is the ceiling; the completion promise is the early exit. Together they bound spend from both ends.

flowchart TD
    Start(["./ralph.sh -n 50"]) --> Cap{"Iteration < cap?"}
    Cap -->|"no"| MaxOut(["exit 1: MAX_ITERATIONS, stop spending"])
    Cap -->|"yes"| Pick["Pick one atomic task from .agent/tasks.json"]
    Pick --> Spawn["Spawn agent with fresh context and chosen model"]
    Spawn --> Work["Read state from disk, do one task"]
    Work --> Gate{"Tests, lint, type check pass?"}
    Gate -->|"no"| Log["Log failure, keep task open"]
    Log --> Cap
    Gate -->|"yes"| Commit["Commit, update task status, screenshot"]
    Commit --> Promise{"Promise tag?"}
    Promise -->|"COMPLETE"| Done(["exit 0: done early, no wasted calls"])
    Promise -->|"BLOCKED or DECIDE"| Stop(["exit 2 or 3: stop and ask a human"])
    Promise -->|"none"| Cap

Read the diagram as a spend story. Three paths end the run: the cap is hit, the work completes, or the agent asks for help. None of them let the loop drip calls into a task that is going nowhere.

Watch the cost per iteration

You cannot control what you cannot see, so the last piece is the per-iteration trail. Ralph records each iteration’s cleaned output to .agent/history/ and appends to the running log at .agent/logs/LOG.md. That history is your audit log for spend: one entry per agent call, in order, with what the agent did and whether the task advanced.

Reading the history tells you the things that actually drive cost. How many iterations did it take to close each task? Is one task failing verification over and over and eating calls? Did the run hit the cap with work still open, or did it exit early on a completion promise? An iteration that did real work and committed is money well spent. A run of iterations that all touch the same failing task is the thrash you want to catch and fix, usually by tightening the task spec or the prompt in .agent/PROMPT.md.

When you need more than the log, get inside the box. Each agent runs in its own Docker Sandbox with a deterministic name, ralph-<agent>-<project-dir>-<hash8>. List them and open a shell:

# list sandboxes
sbx ls

# shell into the running sandbox to inspect logs and history
sbx exec -it ralph-<agent>-<project>-<hash8> bash

From there you can read .agent/history/, re-run a failing test by hand, and figure out why a task is stalling before it costs you more iterations. The Docker Sandboxes documentation covers the sandbox model in full, and the broader practice of making a long run auditable is the subject of observability for AI coding agents.

If the history shows the loop heading the wrong way and you do not want to kill it, edit .agent/STEERING.md. Ralph folds that critical work into the next iteration before resuming the task list. Steering mid-run is cheaper than letting a misdirected loop spend its whole cap and then starting over.

A cost-aware run, end to end

Put the knobs together and a deliberate run looks like this:

# 1. cheap probe: one iteration to confirm the setup is sane
./ralph.sh --once

# 2. mechanical backlog on a cheaper model, bounded cap
./ralph.sh --agent codex -n 30 -- --model gpt-5.5

# 3. read the per-iteration trail to see where the calls went
sbx exec -it ralph-codex-<project>-<hash8> bash   # then read .agent/logs/LOG.md

The probe spends one call to de-risk the run. The bounded loop runs an appropriately priced model with a hard ceiling, gates every iteration on tests so it cannot thrash, works one atomic task at a time so each call stays cheap, and exits early on a completion promise if the work finishes ahead of the cap. The history tells you exactly where the money went so the next run is tighter. That is cost control for an autonomous agent: not a billing alert after the fact, but a loop built so the expensive failure modes cannot happen.

Frequently asked questions

How do I limit how much an autonomous coding agent can spend?

Cap the iterations. The iteration count is the hard ceiling on agent calls per run. Ralph defaults to 10; set it with ./ralph.sh -n 50 or --max-iterations 5, and use ./ralph.sh --once for a one-call dry run. When the loop hits the cap it exits with code 1 instead of running forever.

Should I run a cheaper model for an AI coding agent loop?

Yes, for mechanical work. Renames, boilerplate, lint fixes, and routine refactors do not need a frontier model. Pass the model after the -- separator, for example ./ralph.sh --agent codex -- --model gpt-5.5, and save the expensive model for genuinely hard reasoning. Small tasks plus verification gates make the cheaper model safe to use.

What stops an agent loop from thrashing on a task it cannot finish?

Verification gates and promise tags. Every iteration runs tests, lint, and type checks, so a task only flips to done when its checks pass. If the agent gets stuck it emits BLOCKED or DECIDE, which exits the run early (codes 2 and 3) instead of spending the rest of the iteration cap on a task going nowhere.

Why does a fresh context window per iteration save money?

A single long conversation re-sends a growing transcript on every turn, so the per-call cost climbs as the session drags on. The Ralph technique starts each iteration with a clean context and reads state from disk (.agent/tasks.json, the log, git history), so the token cost of iteration 40 is about the same as iteration 2.

How do I see where the cost went in an autonomous run?

Read the per-iteration trail. Ralph writes each iteration's output to .agent/history/ and appends to .agent/logs/LOG.md, one entry per agent call. Shell into the sandbox with sbx exec -it ralph-<agent>-<project>-<hash8> bash to read it. Look for tasks that fail verification repeatedly, since those are the calls that cost money without making progress.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough