Skip to content
RALPH LOOP

How to Run an AI Coding Agent Overnight (For Days) Without Losing the Plot

A terminal running an autonomous AI coding loop overnight, with a phosphor-green progress log

You can run an AI coding agent overnight, and you can run one that codes for days, but only if you fix three things that kill long runs: context, cost, and crashes. Context rots as a single chat session grows. Cost compounds when an agent thrashes without a stop signal. Crashes lose work when state lives in a transcript instead of on disk. Fix each axis and an agent will grind through a task list while you sleep, then hand you a clean git diff in the morning.

This post is the architecture, not a pep talk. Every command here is real and runs against the ralph.sh loop you install with npx @pageai/ralph-loop. If you want the short version: keep context fresh per iteration, store progress in files and git, cap the loop, gate every change behind tests, and stop on an explicit promise instead of a vibe.

A long run is not one big prompt. It is hundreds of small ones. The naive approach keeps a single agent session alive for hours and feeds it more and more output until the context window is a swamp. That fails in three predictable ways.

flowchart LR
  run["Long autonomous run"] --> ctx["Context axis"]
  run --> cost["Cost axis"]
  run --> crash["Crash axis"]
  ctx --> ctxfail["Context rot: agent forgets goals, repeats work"]
  cost --> costfail["Runaway spend: agent thrashes with no cap"]
  crash --> crashfail["Lost state: progress lived in the transcript"]
  ctxfail --> fix["Fresh context, state on disk, caps, gates, promises"]
  costfail --> fix
  crashfail --> fix

The first axis is context. Language models degrade as the active context grows. The agent that wrote a careful plan at hour zero is, by hour three, distracted by ten thousand lines of tool output and its own earlier mistakes. This is context rot. The plot is literally lost in the window.

The second axis is cost. An agent with no iteration cap and no verification gate will happily spend tokens forever. It edits a file, breaks a test, edits again, breaks a different test, and circles. Without a hard limit and a feedback signal, that circle runs until your bill says stop.

The third axis is crashes. Machines sleep, networks drop, processes get killed, daemons hang. If the agent’s only memory is the chat transcript, a crash at hour four erases four hours of reasoning. You restart from nothing.

The fix for all three is the same idea applied three ways: stop treating the chat session as the unit of work, and start treating the iteration as the unit of work. Each iteration gets a clean slate, reads durable state, does exactly one thing, verifies it, and commits. The loop is a Bash for loop around a stateless step. That is the Ralph technique, popularized by Geoffrey Huntley in his original Ralph writeup, and it is the reason an agent can run for a night or a week without melting down.

Axis one: fix context with fresh windows and state on disk

Section titled “Axis one: fix context with fresh windows and state on disk”

Context is the axis most people get wrong, because the obvious move (keep the agent talking) is the one that fails. The counterintuitive move wins: throw the context away on purpose, every iteration.

Start every iteration with a clean context window

Section titled “Start every iteration with a clean context window”

Each pass of the loop starts the agent fresh. It does not inherit the previous iteration’s chat history. It reads the prompt file, reads the current state, picks up the next task, and works. The window never has time to rot because it never lives long enough.

Here is the cost of getting this wrong. In a single long session, every irrelevant tool result stays in the window and competes for the model’s attention. In a fresh-context loop, iteration 47 starts as crisp as iteration 1. The agent reasons over a small, relevant slice: the prompt, one task spec, and the files it touches.

The loop in ralph.sh rebuilds the prompt on every pass and hands the agent a clean invocation:

Terminal window
PROMPT_CONTENT="PROJECT_ROOT=$SCRIPT_DIR
$(cat $SCRIPT_DIR/.agent/PROMPT.md)"

That .agent/PROMPT.md is the steering wheel. It tells a fresh-context agent how to reorient in seconds: find the next task, do it, verify, commit, stop. Because the agent has no memory of the last iteration, the prompt and the files have to carry everything. That constraint is a feature. It forces all state into durable, inspectable places.

Put the memory on disk, not in the transcript

Section titled “Put the memory on disk, not in the transcript”

If the chat is not the memory, something has to be. The filesystem and git history are the memory layer. Progress lives in files you can open, diff, and grep, not in a context window you cannot.

The state lives under .agent/:

Terminal window
.agent/
PROMPT.md # what the agent does each iteration
prd/PRD.md # the spec and acceptance criteria
prd/SUMMARY.md # the short version
tasks.json # task lookup table
tasks/TASK-{ID}.json # one spec file per task
logs/LOG.md # running progress log
history/ # per-iteration captured output
skills/ # shared skills the agent can use
STEERING.md # mid-run instructions you inject

Three of these carry the weight overnight. .agent/tasks.json is the lookup table the agent reads at the top of every iteration to find the highest-priority incomplete task. .agent/logs/LOG.md is the running narrative, appended as work completes. And the git log is the real source of truth: every finished task is a commit, so the agent (and you) can reconstruct exactly what happened from git log alone.

This is why a crash at hour four is survivable. The agent that wakes up for iteration 48 does not need to remember iteration 47. It reads tasks.json, sees what is still open, reads git log to see what shipped, and continues. State on disk turns a long run into a sequence of resumable steps.

If you want the deep version of this idea, read context engineering for long-running agents. It covers how to structure the prompt and the task files so a fresh-context agent reorients instantly and never re-does finished work.

Axis two: fix cost with caps, model choice, and verification gates

Section titled “Axis two: fix cost with caps, model choice, and verification gates”

An autonomous agent is a metered process. Left alone, it spends until something tells it to stop. Three controls keep the meter honest.

The simplest control is a hard iteration cap. The loop runs a fixed number of times and then exits, no matter what. The default is 10 iterations. For an overnight run you raise it:

Terminal window
./ralph.sh -n 50

That runs at most 50 iterations and then stops with a max-iterations exit. The long form is identical:

Terminal window
./ralph.sh --max-iterations 50

When you want a single controlled step (to test the prompt, or to watch one iteration before committing to a long run), run exactly one:

Terminal window
./ralph.sh --once

The cap is your budget ceiling expressed in iterations instead of dollars. If each iteration finishes one task and you have 40 tasks, -n 50 gives the loop headroom plus a margin for retries. It will never run away to iteration 5000, because the for loop physically stops at the number you set.

Cost per iteration is mostly model cost. The loop runs whichever agent CLI you point it at, and you pick the model. The default agent is claude, and you can switch:

Terminal window
./ralph.sh --agent codex -n 50
./ralph.sh -a cursor -n 5

Supported agents are claude, codex, copilot, cursor, gemini, and opencode. Pass model flags straight through to the underlying CLI after a -- separator:

Terminal window
./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -- --model pro

The practical rule: match the model to the task difficulty. Mechanical tasks (rename a symbol across files, add a test, wire a component) do not need your most expensive model. Hard reasoning tasks (design a migration, untangle a race condition) do. A long overnight run that uses an expensive model for trivial edits is the single most common way people overspend.

The control that actually stops thrashing is verification. An agent that cannot tell whether its change worked will edit forever. An agent that runs the tests after every change gets a signal: green means move on, red means fix or stop.

Each iteration runs the verification stack the loop assumes: Playwright for end-to-end, Vitest for unit tests, TypeScript for types, ESLint for lint, Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work. A change that does not pass the gate does not get committed, which means the agent cannot bank broken work and move on. It either fixes the change in the same iteration or the task stays open for a fresh-context retry.

Verification is the brake pedal on cost. Without it, a loop spins on a bad task indefinitely. With it, a bad task fails fast and visibly. The full treatment lives in verification loops for AI agents, and the money-specific tactics (iteration caps, budget thinking, model selection, gates that prevent thrashing) are in cost control for autonomous coding agents.

Axis three: fix crashes with promises, one task per iteration, and commits

Section titled “Axis three: fix crashes with promises, one task per iteration, and commits”

Reliability over many hours is about two questions: does the loop know when to stop, and is work safe if something dies mid-run?

A loop that runs until “it feels done” never feels done. The Ralph loop stops on an explicit signal the agent emits, called a promise tag. There are three:

  • <promise>COMPLETE</promise> means every task is finished.
  • <promise>BLOCKED:reason</promise> means the agent needs human help.
  • <promise>DECIDE:question</promise> means the agent needs a decision before it can continue.

The loop watches the agent’s output for these tags and acts on them. This is the difference between an agent that quietly stops making progress and one that tells you precisely why it stopped. A blocked agent pings you with the reason. An agent that needs a decision pings you with the question. An agent that finished tells you it finished, and the loop exits cleanly.

Promises map to process exit codes, so the loop is scriptable. You can wire it into a larger pipeline and branch on the result:

Terminal window
./ralph.sh -n 50
echo "exit: $?"

The codes are:

  • 0 COMPLETE, every task passed.
  • 1 MAX_ITERATIONS, the loop hit the cap before finishing.
  • 2 BLOCKED, the agent needs you.
  • 3 DECIDE, the agent needs a decision.

In the morning the exit code is the first thing you check. 0 means review and merge. 1 means the work is bigger than 50 iterations, so look at what shipped and run again. 2 or 3 means the agent hit a wall it correctly refused to guess around, which is exactly the behavior you want from something running unattended.

One task per iteration, commits as checkpoints

Section titled “One task per iteration, commits as checkpoints”

The rule that makes a long run reliable is small and strict: one task per invocation. The agent finds the highest-priority incomplete task, works it, verifies it, commits, and stops. It never batches multiple tasks into one iteration.

This matters for two reasons. First, a single committed task is an atomic, reviewable unit. Your morning review is a sequence of focused commits, not one sprawling diff. Second, every commit is a checkpoint. If the machine dies between task 12 and task 13, you lose at most the in-progress task, because everything before it is already in git. Commits are the save points of an overnight run.

Put the three axes together and a single iteration looks like this:

flowchart TD
  startrun["./ralph.sh -n 50"] --> boot["Sandbox microVM boots (sbx)"]
  boot --> fresh["Iteration i starts with a fresh context window"]
  fresh --> pick["Read .agent/tasks.json, pick highest-priority open task"]
  pick --> work["Work the steps in TASK-ID.json (one task only)"]
  work --> verify["Run Vitest, Playwright, TypeScript, ESLint, Prettier"]
  verify --> gate{"Verification green?"}
  gate -->|no| work
  gate -->|yes| commit["Commit, update task status, append .agent/logs/LOG.md, screenshot"]
  commit --> promise{"Promise tag emitted?"}
  promise -->|COMPLETE| exit0["Exit 0: review the git diff"]
  promise -->|BLOCKED| exit2["Exit 2: agent pings you"]
  promise -->|DECIDE| exit3["Exit 3: agent pings you"]
  promise -->|none| cap{"i less than 50?"}
  cap -->|yes| fresh
  cap -->|no| exit1["Exit 1: MAX_ITERATIONS"]

Here is the actual sequence. It takes a few minutes to set up and then runs unattended.

First, install the loop into your project:

Terminal window
npx @pageai/ralph-loop

Second, make sure your agent is authenticated inside the sandbox. The loop runs each agent in an isolated environment, so you log in there once:

Terminal window
./ralph.sh --login --agent claude

Third, give the loop a task list. The .agent/tasks.json table and the per-task spec files under .agent/tasks/ are what the agent grinds through. The fastest way to build them is the spec-driven path: turn unstructured requirements into a PRD and a task list with the prd-creator skill in plan mode, then let the loop execute. A loop is only as good as its task list, so spend your effort here.

Fourth, kick off the overnight run with a generous cap:

Terminal window
./ralph.sh -n 50

That is the whole launch. The loop boots the sandbox, then runs up to 50 iterations. Each iteration starts fresh, picks the next task, works it, verifies, commits, and either continues or stops on a promise. You close the laptop lid metaphorically and go to bed.

In the morning, the review is a git review. Read the commits in order, because each one is a single task with a clear message. Run the diff:

Terminal window
git log --oneline
git diff main...HEAD

If the exit code was 0, you are reviewing finished work. If it was 1, you are reviewing partial progress and deciding whether to run again. Either way you are reading code, not babysitting a chat.

The same architecture scales from one night to many. An agent that codes for days is just an agent that codes for one night, restarted. Because state lives in tasks.json, LOG.md, and git, you can stop the loop, inspect, and resume without losing context. Run -n 50 tonight, review in the morning, adjust the task list, and run again. Each run resumes from the committed state, so a multi-day build is a series of overnight runs stitched together by the filesystem.

This is the part that single-session agents cannot do. A three-day chat session is a context disaster. Three nights of fresh-context loops, each one resuming from git, is just normal engineering with a faster typist.

Run it in a sandbox so YOLO mode is actually safe

Section titled “Run it in a sandbox so YOLO mode is actually safe”

An agent running unattended needs broad permissions to be useful. It has to edit files, run commands, install packages, and run tests without stopping to ask. On your laptop, that is terrifying: the same agent can also read your SSH keys, your environment variables, and your credentials. The answer is not to weaken the agent. It is to contain it.

The loop runs every agent inside a Docker Sandbox, an isolated microVM managed by the sbx CLI. The sandbox is the boundary, so the agent can run in bypass-permissions mode (Claude Code calls this --dangerously-skip-permissions) without putting your machine at risk. The blast radius is the sandbox, not your home directory. The Docker Sandboxes documentation covers the microVM model in detail.

Each sandbox gets a deterministic name so the loop can find it again: ralph-<agent>-<current-dir>-<hash8>. You can print the name the loop will use before you start:

Terminal window
./ralph.sh --print-name

Inspect and enter the sandbox like any container when you want to look around:

Terminal window
sbx ls
sbx exec -it <name> bash
sbx run <name>

Network is deny-by-default. The agent cannot reach the internet until you allow specific domains, which stops an overnight run from exfiltrating anything or pulling something unexpected. Allow what the build needs and nothing else:

Terminal window
sbx policy allow network <name> registry.npmjs.org

If a run needs to serve a dev port back to your machine (to take screenshots against a running app, for example), publish it:

Terminal window
./ralph.sh --ports

The sandbox is what makes unattended autonomy reasonable. You are not trusting the agent to behave. You are enforcing a boundary it cannot cross. Anthropic’s official Claude Code plugin uses a Stop Hook to re-inject the prompt and keep the agent going, which you can read about in the Claude Code documentation. Our implementation is the hackable ralph.sh script plus the sandbox, so you can read every line of what runs overnight.

Observability: see exactly what happened while you slept

Section titled “Observability: see exactly what happened while you slept”

You cannot trust what you cannot see, and an overnight run is hours you did not watch. The loop writes a paper trail so the morning review is forensic, not faith-based.

Three artifacts make a run auditable. The first is per-iteration history. Every iteration’s captured output is saved to .agent/history/, with a session id and iteration number in the filename so runs never overwrite each other:

Terminal window
.agent/history/ITERATION-20260129-031500-12.txt

That file is the full transcript of iteration 12, ANSI codes stripped, ready to read. If something went sideways at 3am, you open the exact iteration and see what the agent was thinking.

The second artifact is the running log. .agent/logs/LOG.md is appended as tasks complete, so it reads as a chronological narrative of the night. The third is git itself: the commit history is the highest-signal log there is, because every commit ties a finished task to an exact diff and message.

On top of the files, the loop prints live status while it runs: a rolling preview of the agent’s current step, per-iteration timing, average iteration time, and which task ids completed. If you happen to glance at the terminal at midnight, you can tell at a glance whether iterations are getting slower (a sign the agent is struggling on a hard task) or moving steadily.

The full playbook for making an unattended agent auditable, including screenshots and live stream output, is in observability for autonomous coding agents.

Sometimes you wake up, glance at the run, and realize the agent is about to spend ten iterations on something that no longer matters, or it is missing a constraint you forgot to write down. You do not have to kill the loop and lose momentum. You steer it.

Edit .agent/STEERING.md mid-run. The agent reads it and handles the critical work you injected before returning to its normal task list. It is a side channel for “do this next” that does not require stopping and restarting. You keep the warm loop and redirect it, which is far better than a hard restart that throws away in-flight progress.

Steering is the human-in-the-loop control for a process that is otherwise hands-off. You set the task list at the start, the loop runs autonomously, and when reality changes you nudge it through a file. The complete pattern, including when to steer versus when to let the loop finish, is in how to steer a running AI agent.

The architecture is small enough to hold in your head. Context stays sharp because every iteration is a fresh window reading durable state. Cost stays bounded because the loop has a hard cap, you match the model to the task, and verification gates kill thrashing. Crashes stay survivable because the loop stops on explicit promises, does one task per iteration, and commits each one as a checkpoint. The sandbox makes broad permissions safe, and the history files make the whole run auditable after the fact.

If you are new to the underlying loop and want the conceptual foundation before you run one overnight, start with what the Ralph technique is. It explains the Bash-loop idea and where it came from. Then come back here, point ralph.sh at a good task list, run ./ralph.sh -n 50, and review the diff with your coffee.

A long autonomous run is not magic and it is not a single heroic prompt. It is a stateless step wrapped in a loop, with the boring parts (state, limits, gates, signals, isolation, logs) done properly. Get those right and an agent will code through the night, and through the week, without losing the plot.

Frequently asked questions

How do I run an AI coding agent overnight?

Install the loop with npx @pageai/ralph-loop, give it a task list in .agent/tasks.json, authenticate your agent inside the sandbox with ./ralph.sh --login, then start the run with ./ralph.sh -n 50. Each iteration starts with a fresh context, completes one task, verifies it, and commits. In the morning you review the git diff. The loop exits on a completion promise or when it hits the iteration cap.

Why do long AI agent runs lose the plot?

Three reasons. Context rot: a single long chat session degrades as the window fills with old tool output. Runaway cost: an agent with no cap and no verification gate thrashes on bad tasks. Lost state: if progress lives in the transcript, a crash erases it. The fix is a fresh context per iteration, state stored on disk in tasks.json, LOG.md, and git, a hard iteration cap, and tests that gate every change.

How does the loop avoid burning unlimited money?

Three controls. The -n flag sets a hard iteration cap, so the loop physically stops at the number you choose. You match the model to the task so trivial edits do not use your most expensive model. Verification gates run tests after every change, so a broken change is not committed and the agent cannot thrash forever on a bad task. Together these keep cost bounded and predictable.

Can an AI agent really code for days, not just one night?

Yes, because a multi-day build is just a series of overnight runs that resume from committed state. Since progress lives in tasks.json, LOG.md, and the git history rather than in a chat session, you can stop the loop, inspect the work, adjust the task list, and run ./ralph.sh -n 50 again. Each run picks up the highest-priority incomplete task and continues from where the last one stopped.

Is it safe to run an agent unattended with broad permissions?

It is safe when the agent runs inside a sandbox. The loop runs every agent in an isolated Docker Sandbox microVM managed by the sbx CLI, with network access deny-by-default. The agent can run in bypass-permissions mode because the sandbox is the boundary, so the blast radius is the sandbox and not your laptop. You allow specific domains with sbx policy allow network and inspect the run with sbx exec.