Skip to content
RALPH LOOP

Claude Code vs Codex vs Cursor vs Gemini: Best CLI for Long-Running Agent Loops

Six agentic coding CLIs compared for long-running autonomous loops inside a Ralph sandbox

There is no single best agentic CLI for long-running loops. The honest answer is that you pick per task, and the agent matters less than the loop you wrap around it. Claude Code, OpenAI’s Codex CLI, the Cursor CLI agent, Gemini CLI, GitHub Copilot CLI, and opencode all run inside Ralph behind the same --agent flag, so swapping one for another is a one word change, not a rewrite. This post compares the six on the dimensions that actually decide a multi-hour run: autonomy quality, cost, model options, tool use, and ecosystem.

I am not going to quote benchmark numbers. Vendor leaderboards move every few weeks and rarely reflect how an agent behaves over dozens of iterations against your codebase. What follows is concrete and opinionated, grounded in how each CLI is invoked and how each one tends to behave when you leave it running.

Why “best” is the wrong question for a multi-hour loop

Section titled “Why “best” is the wrong question for a multi-hour loop”

A one-shot demo and a multi-hour loop are different sports. In a demo, raw reasoning on the first try wins. In a loop, the agent runs the same task family dozens or hundreds of times, sees its own test failures, and fixes them on the next pass. What you actually want is an agent that responds well to feedback, follows a structured prompt, and exits cleanly so the harness can decide whether to loop again.

That reframes the comparison. The strongest model on a leaderboard can still lose a long run if it ignores the prompt structure, never stops on its own, or burns your budget on token-heavy reasoning for mechanical work. A merely good model with tight verification gates will grind toward correct. This is the core point from the field guide to agentic coding CLIs: the harness around the agent carries more of the result than the agent does.

So the real question is not “which CLI is smartest.” It is “which CLI holds up when I leave it alone.” Hold that distinction while you read the comparison.

How Ralph makes the six CLIs interchangeable

Section titled “How Ralph makes the six CLIs interchangeable”

Ralph is a Bash script you point at a project. It does not reimplement any agent. It wraps whichever one you pick, runs it inside an isolated Docker Sandbox, watches the output, and loops with a fresh context each iteration. Three design choices are what make the agent swappable.

First, one flag selects the agent. claude is the default, and you switch with --agent (short -a):

Terminal window
# Claude Code (default), 50 iterations
./ralph.sh -n 50
# Swap the agent, nothing else changes
./ralph.sh --agent codex
./ralph.sh -a cursor -n 5

Second, a bare -- separator forwards anything after it straight to the underlying CLI. Ralph parses its own flags first, then hands the rest to the agent untouched. That is how you pin a model without Ralph needing to know each vendor’s model names:

Terminal window
./ralph.sh --agent codex -- --model gpt-5.5
./ralph.sh -a gemini -- --model pro

Third, each agent gets its own deterministic sandbox, named ralph-<agent>-<current-dir>-<hash8>. Your Claude sandbox and your Codex sandbox never share credentials, history, or installed tools, so you can compare agents on the same project without them stepping on each other. Print the name without starting a run:

Terminal window
./ralph.sh --print-name --agent codex

Under the hood, Ralph builds a different invocation per agent but keeps the loop identical. The expansions are:

Terminal window
# claude: sbx run ... claude . -- --output-format stream-json --verbose -p "$PROMPT_CONTENT"
# codex: sbx run ... codex . -- exec "$PROMPT_CONTENT"
# copilot: sbx run ... copilot . -- -p "$PROMPT_CONTENT"
# cursor: sbx run ... cursor . -- -p "$PROMPT_CONTENT"
# gemini: sbx run ... gemini . -- -p "$PROMPT_CONTENT"
# opencode: sbx run ... opencode . -- run "$PROMPT_CONTENT"

Every one of those runs the same loop: pick the top task from .agent/tasks.json, work it, run the verification stack, commit, and either continue or stop on a promise tag. Because the loop is shared, you can treat the choice of CLI as a variable. Pick one, run it, and if it stalls on your codebase, change the flag and try another. The mechanics never move.

Five things separate these agents once you are running them unattended. I will go through each, then break the agents down one by one.

Autonomy quality is how well the agent works without a human in the chair. Does it follow the prompt structure, stay on one task per invocation, run its own tests, and emit a clean completion signal? A loop has nobody to answer “should I proceed?”, so any agent that pauses for approval will stall unless you put it in a non-interactive mode.

Cost over a loop is dominated by model choice and iteration count, not the per-call price you see in a demo. Fifty iterations on a flagship model add up. The lever is the model flag and the iteration cap, which I cover in depth alongside the overnight-run architecture in how to run an AI coding agent overnight.

Model options decide whether you can match the model to the task. Heavy reasoning (architecture, gnarly refactors) rewards the strongest model. Mechanical work (CRUD wiring, lint fixes, filling in tests) runs fine and cheaper on a mid tier model. The more model choices a CLI exposes through -- --model, the more you can tune.

Tool use is how the agent edits files, runs shell commands, and reads results. All six edit files and run commands. The difference shows up in how cleanly they run headless and how readable their output is while looping.

Ecosystem is auth, billing, and which world you already live in. A team standardized on GitHub auth has a different default than one paying for an Anthropic or OpenAI plan.

Claude Code is Ralph’s default for a reason. It is steady on long, multi-step tasks and follows a structured prompt closely, which is exactly what a fresh-context loop needs. Ralph runs it with --output-format stream-json --verbose, and that structured stream gives the loop the richest live, readable step view of any of the six. When you want to watch what the agent is doing each iteration, Claude Code shows the most.

It runs in bypass-permissions mode inside the sandbox (--dangerously-skip-permissions, or --permission-mode bypassPermissions), so it never pauses for approval during an unattended run. Model selection goes through -- --model. For autonomy quality and clarity of feedback, this is the smoothest first run. The end-to-end setup is in running Claude Code in a loop. See the Claude Code docs for the flag surface.

Codex is OpenAI’s agent, run through its non-interactive exec mode. It is a strong reasoner on hard, self-contained problems, which makes it a natural pick when the task is logic-heavy rather than sprawling. Pin a model with -- --model gpt-5.5.

The thing to know about Codex in a loop: codex exec runs read-only by default, so a loop using the default mode will spin without ever editing a file. You grant write access deliberately with -- --sandbox workspace-write --ask-for-approval never, or bypass Codex’s own gates entirely with -- --dangerously-bypass-approvals-and-sandbox. The bypass flag is safe here because the microVM is the real boundary, not Codex policing itself. Codex also has a clean --json event stream for CI parsing. The full wiring, including the read-only gotcha and CI flags, is in running the Codex CLI in an autonomous loop.

The Cursor CLI agent brings the headless cursor-agent to the loop, invoked with -p. If your team already lives in Cursor, running the same agent unattended over a sandbox lets you review a finished diff in the morning instead of pair-programming all afternoon. The autonomy quality is solid, and the appeal is continuity: you keep the agent and the mental model you already trust, and you add looping on top. Model and other flags pass through after -- like every other agent.

Gemini CLI is Google’s agent, driven with -p and model selection through -- --model pro. It is worth reaching for when you want a different model family in the mix or you already live in the Google ecosystem. The argument for keeping a non-Anthropic, non-OpenAI option in your rotation is practical: when one agent stalls on a specific task, a different model family sometimes walks straight through it. Ralph makes that switch a one word change.

Copilot CLI slots in for teams standardized on GitHub’s tooling and auth. Ralph runs it with the same -p prompt pattern and the same sandbox boundary as the rest. The pull here is ecosystem, not raw capability: if your auth, your billing, and your repos already run through GitHub, Copilot is the path of least friction. The loop treats it identically to the others.

opencode is the open source option, invoked with its run subcommand. It is the pick when you want to avoid vendor lock-in, run an agent you can fully inspect and modify, or route to a provider and model of your own choosing. For cost-sensitive runs where you want maximum control over the model layer, an open agent you can point at any backend is a real advantage. You trade some of the polished, batteries-included feel of the vendor CLIs for control and inspectability.

Here is the practical version, stripped of hedging. Match the agent to the situation rather than hunting for one winner.

  • Pick Claude Code when you want the smoothest first loop and the clearest live view of what the agent is doing. It is the right default, and the right place to start if you are new to running loops.
  • Pick Codex when the task is logic-heavy and self-contained, and you want explicit model pinning plus a clean JSON event stream for CI. Remember to grant write access, or it will not edit anything.
  • Pick Cursor when your team already uses Cursor and you want the same agent to run unattended so you review a diff instead of babysitting.
  • Pick Gemini when you want a different model family in your rotation, or you are already in the Google ecosystem.
  • Pick Copilot when your auth and billing already run through GitHub and you want the least new setup.
  • Pick opencode when you want an open, inspectable agent, no vendor lock-in, or full control over the model and provider behind it.

A decision guide for the common case:

flowchart TD
  Start(["Choosing an agent for a long loop"]) --> Q1{"New to running loops?"}
  Q1 -->|"yes"| Claude["Start with claude (default, richest live view)"]
  Q1 -->|"no"| Q2{"What matters most?"}
  Q2 -->|"hard reasoning task"| Codex["codex (grant write access)"]
  Q2 -->|"already in Cursor"| Cursor["cursor"]
  Q2 -->|"different model family"| Gemini["gemini -- --model pro"]
  Q2 -->|"GitHub-standardized team"| Copilot["copilot"]
  Q2 -->|"open, no lock-in"| Opencode["opencode"]
  Claude --> Swap{"Stalls on your codebase?"}
  Codex --> Swap
  Cursor --> Swap
  Gemini --> Swap
  Copilot --> Swap
  Opencode --> Swap
  Swap -->|"yes"| Change["Change one --agent flag, retry"]
  Swap -->|"no"| Ship["Let the loop run, review in the morning"]

The last edge is the important one. Because the harness is shared, “this agent stalled” is not a dead end. It is a flag change. That is the whole reason Ralph treats the agent as swappable rather than picking one for you.

If you take one thing from this comparison, take this: a weaker agent inside a good loop will out-ship a stronger agent you babysit in a single session. The loop gives the agent fresh context every iteration, keeps state on disk, isolates it in a sandbox, and runs real verification gates. Those four things matter more than the gap between any two of these CLIs.

Fresh context per iteration is what beats context rot. Each pass boots the agent clean, so it does not drag an hours-long transcript from one task to the next. The filesystem and git history are the memory layer: progress lives in .agent/tasks.json, the per-task spec files, .agent/logs/LOG.md, and the git log, not in a chat window. This is the mechanic Geoffrey Huntley described in the original Ralph writeup, and it applies identically to all six agents.

Verification is what lets a cheaper model succeed. The loop assumes a stack of Playwright for end-to-end tests, Vitest for unit tests, TypeScript for types, ESLint for lint, and Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work. The agent does not need to be right on the first try. It needs to write code, run the gates, read the failures, and fix them next pass. When you compare agents for long runs, you are really comparing how each one responds to that feedback, not how clever its first draft looks.

The sandbox is what makes any of this safe to walk away from. Ralph runs every agent inside a Docker Sandbox microVM with deny-by-default networking, so bypass-permissions mode is reasonable: the worst the agent can do is wreck a disposable VM, not read your SSH keys. You open exactly what a task needs with sbx policy allow network <name> <domain>. The full reasoning is in the Docker Sandboxes docs.

And the loop stops on a signal, not a vibe. Each agent emits a promise tag that Ralph reads:

  • <promise>COMPLETE</promise> means every task is finished.
  • <promise>BLOCKED:reason</promise> means the agent needs human help.
  • <promise>DECIDE:question</promise> means it needs a decision you have to make.

Those map to exit codes: 0 for COMPLETE, 1 for MAX_ITERATIONS, 2 for BLOCKED, and 3 for DECIDE. You branch on them in CI or a wrapper script, which means the comparison between agents is also fair: every one of them ends with a verdict you can act on.

There is no best agentic CLI, so stop looking for one. Start with Claude Code because it is the steady default with the clearest output. Reach for Codex on hard reasoning tasks, Cursor when your team already uses it, Gemini for a different model family, Copilot when you are GitHub-standardized, and opencode when you want an open agent with no lock-in. Then let the harness do the heavy lifting.

The shortest path to running any of them is three commands:

Terminal window
# 1. install
npx @pageai/ralph-loop
# 2. authenticate the agent inside its sandbox
./ralph.sh --login --agent codex
# 3. run the loop on your chosen agent and model
./ralph.sh --agent codex -n 50 -- --model gpt-5.5

Swap codex for any of the six and the loop is identical. Pick the agent, pin the model, and let it work one task per iteration inside the sandbox while you sleep. If it stalls, you already know the fix: change one flag.

Frequently asked questions

Which is the best agentic CLI for long-running loops?

There is no single winner. Claude Code is the steady default with the richest live output, Codex is strong on hard reasoning tasks, Cursor suits teams already using Cursor, Gemini brings a different model family, Copilot fits GitHub-standardized teams, and opencode is the open source option. The loop you wrap around the agent matters more than the agent, so pick one and swap with the --agent flag if it stalls.

Claude Code vs Codex: which should I use?

Use Claude Code as the default when you want the smoothest unattended run and the clearest live view of each iteration, since Ralph parses its stream-json output. Use Codex for logic-heavy, self-contained problems and CI pipelines that parse its JSON event stream. One gotcha: codex exec is read-only by default, so grant write access with -- --sandbox workspace-write --ask-for-approval never or it will not edit files.

How do I switch between agents in Ralph?

Pass the --agent flag, or -a for short. ./ralph.sh runs Claude Code by default, ./ralph.sh --agent codex runs Codex, and ./ralph.sh -a cursor -n 5 runs Cursor for five iterations. Each agent gets its own sandbox named ralph-<agent>-<dir>-<hash8>, so they do not share credentials or history. The loop itself is identical across all six.

Does the choice of agent matter more than the loop?

No. A weaker agent inside a good loop, with fresh context each iteration, state on disk, sandbox isolation, and real verification gates, will out-ship a stronger agent you babysit in a single session. Those four properties decide the result more than the gap between any two CLIs, which is why Ralph treats the agent as a swappable variable.

How do I pick a model for each agent?

Model selection is an agent flag, not a Ralph setting, so you pass it after the -- separator. For example ./ralph.sh --agent codex -- --model gpt-5.5 or ./ralph.sh -a gemini -- --model pro. Match the model to the task: a strong model for heavy reasoning, a cheaper mid tier model for mechanical work, since model choice and iteration count are the biggest cost levers in a long run.