Agentic Coding CLIs: How to Run Claude Code, Codex, Cursor and More in a Loop

Terminal showing an agentic coding CLI running in a Ralph loop inside a Docker sandbox

Jan 25, 2026 - 17 min read - 3600 words

Creator of RalphLoop.sh, founder of PageAI

An agentic coding CLI is an AI agent you drive from the terminal. It reads your files, edits them, runs commands, runs your tests, and commits, without you approving each keystroke. Ralph wraps six of them behind a single flag and runs whichever one you pick in a loop with fresh context every iteration. This post is the field guide: what these tools actually are, how ralph.sh calls each one, how to loop any of them safely, and which one holds up over a long autonomous run.

What is an agentic coding CLI?

A plain coding assistant answers a question and stops. An agentic coding CLI takes a goal and acts on it. It plans, runs shell commands, edits files on disk, reads the output, and decides what to do next. The “agentic” part is the action loop: the agent observes the result of its own work and keeps going until it decides the task is done.

Three properties separate an agentic CLI from a chat box:

It edits files directly. No copy and paste. The agent writes to your working tree.
It runs commands. It installs packages, runs the test suite, greps the codebase, and reads stack traces.
It commits. A real agentic run leaves a git history you can review, not a transcript you have to reconstruct.

You run these from the terminal on purpose. The terminal is where your tools already live: git, your package manager, your test runner, your linter. An agent that lives in the same place can use all of them. That is why Claude Code, OpenAI’s Codex CLI, the Cursor CLI agent, Gemini CLI, GitHub Copilot CLI, and opencode all ship as command line programs first.

The catch is that a single prompt rarely finishes a real task. The agent loses the plot on long sessions, the context window fills with stale tool output, and the model starts repeating itself. The fix is not a bigger prompt. The fix is a loop. You restart the agent with a clean context window over and over, and you keep the actual progress on disk instead of in the chat. That technique is the Ralph loop, popularized by Geoffrey Huntley in his original Ralph writeup. Anthropic later shipped an official Claude Code plugin that re-injects the prompt with a Stop Hook. Ralph is the hackable Bash version of the same idea, and it works across all six agents.

If you want the full conceptual background before the practical part, read what the Ralph technique is and where it came from. The rest of this post assumes you want to run a specific agent and need to know how.

Which agentic coding CLIs does Ralph support?

Ralph supports six agents today. You select one with --agent (or the short -a), and claude is the default if you pass nothing.

# Claude Code (default)
./ralph.sh -n 50

# Pick a different agent
./ralph.sh --agent codex
./ralph.sh -a cursor -n 5

The supported set is claude, codex, copilot, cursor, gemini, and opencode. Each name maps to that vendor’s own CLI:

claude is Anthropic’s Claude Code. See the Claude Code docs.
codex is OpenAI’s Codex CLI, invoked in non-interactive exec mode.
copilot is the GitHub Copilot CLI.
cursor is the Cursor CLI agent (the headless cursor-agent).
gemini is Google’s Gemini CLI.
opencode is the open source opencode agent, invoked with its run subcommand.

Ralph does not reimplement these agents. It wraps them. Each agent already knows how to read a repo and edit files. Ralph’s job is to hand the same prompt to whichever agent you chose, run it inside an isolated sandbox, watch the output, and decide whether to loop again. That separation is the whole point: you keep using the agent you trust, and you get autonomous looping for free.

How do you pass agent-specific flags?

Everything after a bare -- is forwarded to the underlying agent CLI untouched. Ralph parses its own flags before the separator, then appends the rest to the agent invocation. This is how you set the model, the reasoning effort, or any vendor flag Ralph does not know about.

# Pin a Codex model
./ralph.sh --agent codex -- --model gpt-5.5

# Pin a Gemini model
./ralph.sh -a gemini -- --model pro

The pattern is always the same. Ralph’s flags (--agent, -n, --once, --max-iterations) go first. The agent’s own flags go after --. If you have ever run npm run something -- --flag, the mental model is identical.

A couple of agent-specific notes worth knowing. Claude Code is run with --output-format stream-json --verbose so Ralph can parse the structured stream and show you a live, readable step view. The other agents stream their raw output, which Ralph still captures and logs, but the parsed step display is richest with Claude today. None of that changes how you drive them. You still pick the agent with --agent and pass extra flags after --.

How do you run any agent in a loop with fresh context?

The loop is the same regardless of which agent you picked. Here is one iteration, start to finish:

Find the highest-priority incomplete task in .agent/tasks.json.
Work the steps in .agent/tasks/TASK-{ID}.json.
Run tests, linting, and type checking.
Complete the task, take a screenshot, update task status, and commit.
Repeat until every task passes or the iteration cap is hit.

The number that controls the loop is iterations. The default is 10. You set your own with -n or --max-iterations, and you run exactly one pass with --once.

# Default: up to 10 iterations
./ralph.sh

# Up to 50 iterations
./ralph.sh -n 50

# A single iteration (great for a smoke test)
./ralph.sh --once

# Same as -n 5
./ralph.sh --max-iterations 5

The critical word in step 5 is “repeat,” and the critical detail is what does not carry over between iterations: the agent’s context window. Each iteration starts the chosen agent with a clean context. This is deliberate. Long single sessions suffer from context rot, where the window fills with old tool output and half-finished reasoning until the agent forgets what it was doing. A fresh context every iteration sidesteps that failure mode entirely.

So where does the progress live if not in the chat history? On disk. The filesystem and git history are the memory layer. State lives in .agent/tasks.json, in the per-task spec files under .agent/tasks/, in .agent/logs/LOG.md, in the per-iteration logs under .agent/history/, and in the git log itself. When a new iteration begins, the agent reorients by reading those files, not by recalling a conversation it no longer has.

That is the trade that makes long runs work. You give up conversational continuity and you get durability. A crash, a restart, or a fresh context never loses progress, because progress was written down. If you want to go deeper on this specific mechanic, the cluster on running Claude Code in a loop walks through the prompt file and completion criteria step by step.

How does the loop know when to stop?

It stops on an explicit signal, not a vibe. The agent emits a promise tag in its output, and Ralph reads it:

<promise>COMPLETE</promise> means all tasks are finished.
<promise>BLOCKED:reason</promise> means the agent needs human help.
<promise>DECIDE:question</promise> means the agent needs a decision before it can continue.

Those map to process exit codes so you can script around a run: 0 for COMPLETE, 1 for MAX_ITERATIONS, 2 for BLOCKED, and 3 for DECIDE. A run that exits 2 woke you up for a reason. A run that exits 0 finished the work. This is what lets you start a loop, walk away, and trust the exit code in the morning.

One rule keeps the loop honest: one task per invocation. The agent picks the top task, finishes it, commits, and stops that iteration. It never batches several tasks into one pass. Small, committed, verified units of work are what make a long run recoverable instead of a giant uncommitted mess.

Here is the full picture, from the agent you select through the sandbox and into the loop:

flowchart TD
  Start["./ralph.sh --agent codex -n 50"] --> Select{"Which agent?"}
  Select -->|"claude (default)"| Claude["claude"]
  Select -->|codex| Codex["codex exec"]
  Select -->|cursor| Cursor["cursor-agent -p"]
  Select -->|gemini| Gemini["gemini -p"]
  Select -->|copilot| Copilot["copilot -p"]
  Select -->|opencode| Opencode["opencode run"]
  Claude --> Sandbox
  Codex --> Sandbox
  Cursor --> Sandbox
  Gemini --> Sandbox
  Copilot --> Sandbox
  Opencode --> Sandbox
  Sandbox["sbx microVM: ralph-agent-dir-hash8"] --> Loop
  subgraph Loop["Loop with fresh context each iteration"]
    Task["Pick top task from .agent/tasks.json"] --> Work["Edit files, run commands"]
    Work --> Verify["Tests, lint, types, screenshot"]
    Verify --> Commit["Commit and update status"]
    Commit --> Promise{"promise tag?"}
    Promise -->|none| Task
    Promise -->|"COMPLETE / BLOCKED / DECIDE"| Exit["Exit with code 0, 2, or 3"]
  end

Why does sandbox isolation matter for autonomy?

Autonomy and your laptop’s filesystem are a dangerous combination. An agent running with your user permissions can read your SSH keys, your .env files, your cloud credentials, and your shell history. It can also run any command, which in bypass-permissions mode means it will, without asking. The honest answer to “is it safe to let an agent run unattended on my machine” is no, not directly.

Ralph runs every agent inside a Docker Sandbox, an isolated microVM managed by the sbx CLI. The sandbox is the boundary, not the agent’s good judgment. Inside that boundary, the agent can run in full bypass-permissions mode (Claude Code calls this --dangerously-skip-permissions, or --permission-mode bypassPermissions) because the worst it can do is wreck a disposable VM, not your host. You can read more about the underlying tech in the Docker Sandboxes docs.

Each sandbox gets a deterministic name so the same project and agent always reattach to the same VM: ralph-<agent>-<current-dir>-<hash8>. You can see it without starting a run:

./ralph.sh --print-name

That determinism matters for a loop. Iteration one typically creates the sandbox, and iteration two onward reattaches to it. If you manually remove the sandbox between runs, Ralph re-probes and recreates it, so the loop self-heals. You inspect what is running with sbx ls, you shell in with sbx exec -it <name> bash, and you reattach a session with sbx run <name>.

Network access inside the sandbox is deny-by-default. The agent cannot phone home or pull arbitrary domains unless you allow them. You open up exactly what a task needs:

# Allow one domain for one sandbox
sbx policy allow network <name> registry.npmjs.org

# Allow it globally for every sandbox (-g), or open everything ("**")
sbx policy allow network -g "**"

This is the piece that turns “scary” into “fine.” The agent runs flat out with no permission prompts, and the blast radius is a microVM with an allowlist you control. The full reasoning, including how Docker Sandboxes compare to a hand-rolled container, lives in how to run AI coding agents in Docker sandboxes safely. If you only adopt one habit from this whole post, make it this one: never run an autonomous agent without a sandbox.

Two more conveniences come from the same sbx integration. You authenticate an agent inside its sandbox once with ./ralph.sh --login (add --agent X for a specific one), and you publish a dev server port out of the sandbox with ./ralph.sh --ports so you can hit the app the agent is building in your own browser.

How do you choose a model per agent?

Model selection is not a Ralph setting. It is an agent flag, and you pass it after --. Ralph stays out of the way on purpose: each vendor names and versions its models differently, so forwarding the flag is more honest than maintaining a lookup table that goes stale every month.

# Codex on a specific model
./ralph.sh --agent codex -- --model gpt-5.5

# Gemini on its pro tier
./ralph.sh -a gemini -- --model pro

Claude Code accepts its own --model flag the same way, and so do the others. The rule does not change: whatever flag the agent’s own CLI uses to pick a model, you put it after the -- separator and Ralph hands it through.

A few practical heuristics for picking a model when you are running a loop rather than a single prompt:

Bigger is not always better for loops. A long run is dozens or hundreds of iterations. A slightly cheaper, faster model that nails the verification gates beats a flagship model that is twice the cost and only marginally smarter on routine tasks.
Match the model to the task type. Heavy reasoning tasks (architecture, gnarly refactors) reward the strongest model. Mechanical tasks (wiring up CRUD, fixing lint, filling out tests) often run fine and far cheaper on a mid tier model.
Watch your spend across iterations, not per call. A single iteration looks cheap. Fifty of them on a flagship model do not. Model choice is the biggest lever you have on the cost of an overnight run.

If cost control is the thing keeping you from letting a loop run for hours, the model flag is where you start, but it is not the whole story. Iteration caps and the verification gates matter just as much.

Which agentic CLI is best for long-running loops?

There is no single winner. The right agent depends on what you are optimizing for, and the honest comparison is about behavior over a multi-hour run, not a one-shot demo. Here is how the six break down in practice.

Claude Code is the default in Ralph for a reason. It is steady on long, multi-step tasks, it follows a structured prompt closely, and its stream-json output gives Ralph the richest live view of what the agent is doing each iteration. If you want the smoothest first run and the clearest feedback while it works, start here. The dedicated guide on running Claude Code in a loop covers its setup end to end.

Codex CLI is OpenAI’s agent, run through its non-interactive exec mode. It is a strong reasoner on hard, self-contained problems and pairs well with explicit model pinning via -- --model. It is a natural pick when the task is logic-heavy rather than sprawling. See how to run the Codex CLI in an autonomous loop for the wiring details and model flags.

Cursor CLI agent brings the headless cursor-agent to the loop. If your team already lives in Cursor, running the same agent unattended over a sandbox lets you review a finished diff in the morning instead of pair-programming all afternoon. The walkthrough is in running the Cursor CLI agent in a loop.

Gemini CLI is Google’s agent, driven with a -p prompt and model selection through -- --model pro. It is worth a look when you want a different model family in the mix or you are already in the Google ecosystem. The setup, including verification with tests and screenshots, is in running the Gemini CLI in a loop.

GitHub Copilot CLI slots in for teams standardized on GitHub’s tooling and auth. Ralph runs it with the same -p prompt pattern and the same sandbox boundary as the rest.

opencode is the open source option, invoked with its run subcommand. It is the pick when you want to avoid vendor lock-in or run an agent you can fully inspect and modify.

The pattern across all six is that the agent matters less than the harness around it. A weaker agent inside a good loop, with fresh context, on-disk memory, sandbox isolation, and real verification gates, will out-ship a stronger agent you babysit in a single session. That is why Ralph treats the agent as swappable. Pick one, run it, and if it stalls on your codebase, change one flag and try another.

For a deeper, side-by-side breakdown of how each CLI holds up over a multi-hour run, including where each one tends to stall, read the comparison of the best agentic CLI for long-running tasks. That post is the place to settle the Claude Code versus Codex versus Cursor versus Gemini question with the specifics.

What does verification have to do with the agent choice?

More than you would think. The loop assumes a verification stack: Playwright for end to end tests, Vitest for unit tests, TypeScript for types, ESLint for lint, and Prettier for format. The repo mantra is blunt: if you didn’t test it, it doesn’t work.

Verification is what lets a weaker or cheaper model succeed in a loop. The agent does not need to be right on the first try. It needs to write code, run the gates, see the failures, and fix them on the next iteration. The tests are the feedback. An agent with strong reasoning but no verification gates will confidently ship broken code. An agent with modest reasoning and tight gates will grind toward correct. When you compare agents for long runs, you are really comparing how well each one responds to that feedback loop, not how clever its first draft is.

Putting it together

The shortest path to an autonomous run is three commands. Install Ralph, authenticate your agent inside its sandbox, and start the loop.

# 1. Install
npx @pageai/ralph-loop

# 2. Authenticate the agent inside its sandbox
./ralph.sh --login --agent codex

# 3. Run the loop on your chosen agent and model
./ralph.sh --agent codex -n 50 -- --model gpt-5.5

From there, the loop reads .agent/tasks.json, works one task per iteration inside the sbx microVM, verifies with your test stack, commits, and either keeps going or exits on a promise tag. You pick the agent. You pick the model. The harness handles the rest, and the sandbox keeps it safe to walk away from.

If you are new to the underlying idea, start with what the Ralph technique is. If you already know it and just want to run a specific tool, jump straight to the cluster for your agent above. The mechanics are shared. The flag is the only thing that changes.

Frequently asked questions

What is an agentic coding CLI?

An agentic coding CLI is an AI agent you run from the terminal that reads your files, edits them, runs commands and tests, and commits, all without you approving each step. It differs from a chat assistant because it acts on a goal in a loop instead of just answering a question.

Which agents does Ralph support?

Ralph supports six: claude (the default, Anthropic Claude Code), codex (OpenAI Codex CLI), copilot (GitHub Copilot CLI), cursor (the Cursor CLI agent), gemini (Google Gemini CLI), and opencode. You select one with the --agent flag, or -a for short.

How do I pass a model or other flags to the agent?

Put Ralph flags first, then a bare double dash, then the agent flags. Anything after the double dash is forwarded to the underlying CLI untouched. For example, ./ralph.sh --agent codex -- --model gpt-5.5 pins the Codex model, and ./ralph.sh -a gemini -- --model pro pins the Gemini model.

Why does each agent run in a Docker sandbox?

A sandbox is the boundary that makes unattended autonomy safe. Ralph runs every agent inside an isolated Docker Sandbox microVM with deny-by-default networking, so the agent can run in full bypass-permissions mode without risking your host machine, your credentials, or your SSH keys.

Which agentic CLI is best for long-running loops?

There is no single winner. Claude Code is the steady default with the richest live output, Codex is strong on hard reasoning tasks, Cursor suits teams already using Cursor, Gemini brings a different model family, Copilot fits GitHub-standardized teams, and opencode is the open source option. The harness matters more than the agent, so pick one and swap if it stalls.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough