Observability for Autonomous Coding Agents: Logs, History, and Live Output

Terminal showing a Ralph loop live stream with step detection, timing metrics, and per-iteration history

Mar 14, 2026 - 14 min read - 2900 words

Creator of RalphLoop.sh, founder of PageAI

You cannot trust what you cannot see. An autonomous coding agent that runs for hours while you sleep is a black box unless you instrument it, so the rule for AI agent observability is simple: make every iteration leave a trail you can read after the fact and a live signal you can read during the run. Ralph does both. It streams a parsed preview of what the agent is doing right now, classifies each line into a named step, writes a clean log per iteration to disk, records a running progress file, captures screenshots per task, and commits after every task so the git log doubles as an audit trail.

This is the observability piece of the larger guide to running an AI coding agent overnight. Long runs fail quietly when you have no visibility into them, so the surfaces below are what let you walk away from a loop and still know exactly what happened when you walk back.

Why observability is the difference between trust and hope

A single prompt finishes in seconds and you read the diff. A loop of 50 iterations runs for an hour or more, edits dozens of files, runs tests, and commits along the way. If the only thing you have at the end is a final message, you are hoping the agent did the right thing 50 times in a row. Hope is not a review process.

Observability replaces hope with evidence. Every claim the agent makes (“tests pass”, “task done”) should have a file you can open to verify it. Every minute the loop spends should map to a step you can name. Every commit should be small enough to read. When those three things are true, you can audit an overnight run in the time it takes to drink a coffee, and you can catch a loop that has gone sideways before it burns another twenty iterations.

The repo mantra applies here too: if you didn’t test it, it doesn’t work. The corollary for observability is that if you cannot see it, you cannot trust it.

What an autonomous coding agent should expose

Ralph wraps your chosen agent (claude, codex, cursor, gemini, copilot, or opencode) and turns its raw output stream into a set of observability surfaces. You start a run the usual way:

./ralph.sh -n 50

From that point on, five surfaces are live or being written. Here is each one and where it lives.

Live stream preview and step detection

While the agent works, Ralph reads its stream-json output line by line, parses out the text and tool calls, and shows two things under a spinner: the current step name and a dimmed rolling preview of the latest line. You are not watching a frozen spinner wondering if the process hung. You are watching a parsed feed of what the agent is touching right now.

The step name comes from a classifier that maps output patterns to one of fourteen named steps:

Thinking, Planning, Reading code, Web research, Implementing,
Debugging, Writing tests, Testing, Linting, Typechecking,
Installing, Verifying, Waiting, Committing

A line that calls a Write or Edit tool with a file path reads as Implementing. A vitest or playwright invocation reads as Testing. A git commit reads as Committing. An eslint or prettier run reads as Linting. The point is not perfect accuracy on every line. The point is that at a glance you know whether the agent is reading code, writing it, or running tests, without parsing raw JSON yourself.

The Waiting step is the one to watch. It fires on patterns like a question prompt or “blocked on”, which on an unattended loop usually means the agent is stuck asking for input that nobody is there to give.

Per-iteration history in .agent/history/

Every iteration writes its full output, with the ANSI color codes stripped out, to a timestamped file:

.agent/history/ITERATION-<session>-<n>.txt

The session id is a YYYYMMDD-HHMMSS stamp taken when the run starts, so a fresh run never overwrites the history of an earlier one. Iteration 7 of a run that started at 02:15:00 lands in ITERATION-20260314-021500-7.txt. To replay what the agent thought and did on a specific iteration, open that file. To scan the tail of the latest one while a run is going, point tail at the directory:

tail -f .agent/history/ITERATION-*.txt

This is the most underrated surface. The live preview is ephemeral, but the history file is the full, clean transcript of a single fresh-context iteration. When a task goes wrong three iterations back, this is where you find out why.

The progress log: .agent/logs/LOG.md

.agent/logs/LOG.md is the human-readable run journal. Ralph creates it on first run, and the agent appends an entry per task with the date, a brief summary, and the path to the screenshot it captured, newest entry at the top. It is the high-level story of the run, where the history files are the line-by-line detail.

Read it top to bottom in the morning and you get the narrative: what shipped, in what order, and where to look for the visual proof of each step.

# the story of the run, newest first
head -n 40 .agent/logs/LOG.md

Screenshots per task

Step four of every iteration is “complete the task, take a screenshot, update status, and commit.” The agent saves UI screenshots to .agent/screenshots/TASK-<id>-<index>.png and references that path in the log entry. For anything with a UI, this is the difference between trusting a green test and seeing the rendered result.

Screenshots also feed back into the loop. When the agent debugs a regression, it uses earlier screenshots as a reference for what the UI looked like before the change. That makes the screenshot folder both an audit artifact for you and a memory aid for the next fresh-context iteration.

Timing metrics per step and per iteration

Ralph times each iteration and each step inside it. After every iteration it prints the iteration duration, the delta against the previous iteration (green when faster, red when slower, in a stock-ticker style), a running average, and the total elapsed time. It also breaks the iteration down by step, sorted by time spent, so you can see that an iteration spent most of its minutes in Testing and Debugging rather than Implementing.

At the end of the run it prints session totals across all iterations. Timing is a cheap, powerful signal: iterations that keep getting longer, or that spend a growing share of time in Debugging, are a thrashing loop telling on itself before it blows your budget. Watching that trend is the core of cost control for autonomous AI coding agents.

Here is how the surfaces sit around a single iteration of the loop.

flowchart TD
    Agent["Agent runs (fresh context)"] --> Stream["stream-json output"]
    Stream --> Live["Live: spinner step + rolling preview"]
    Stream --> Detect["detect_step: Thinking / Implementing / Testing ..."]
    Stream --> Hist[".agent/history/ITERATION-(session)-(n).txt"]
    Agent --> Shots[".agent/screenshots/TASK-(id)-(index).png"]
    Agent --> Log[".agent/logs/LOG.md (newest first)"]
    Agent --> Commit["git commit per task"]
    Detect --> Timing["Per-step and per-iteration timing"]
    Agent --> Promise{"Promise tag?"}
    Promise -->|"none"| Next["Next iteration"]
    Promise -->|"BLOCKED or DECIDE"| Notify["Desktop notification + sound"]
    Promise -->|"COMPLETE"| Exit["exit 0"]

Git history as the audit trail

The strongest observability surface is one you already know how to read. Ralph follows one rule: one task per invocation. The agent completes exactly one task, commits, and stops, then the next iteration starts fresh. It never batches several tasks into a single commit. That discipline, covered in depth in the pillar on running an agent overnight, means the git log is a clean, chronological record of the whole run, one commit per finished task.

So your morning review is a normal code review:

# what landed overnight, one line per task
git log --oneline --since="12 hours ago"

# the full diff for a single suspicious task
git show <commit>

# everything since you walked away
git diff HEAD@{12.hours.ago}

Small commits keep each diff reviewable, which is the whole reason the one-task rule exists. A loop that commits 40 tiny, well-scoped changes is auditable. A loop that drops one giant commit at the end is not. Git history is also the agent’s memory layer: each fresh-context iteration reads recent commits, the task list, and the progress log to reorient, rather than carrying a bloated transcript forward.

Notifications when the agent needs you

Most of a loop runs without you. The two moments you actually need to know about are when the agent gets stuck or hits a fork it cannot resolve alone. Ralph surfaces both through promise tags the agent emits in its final message:

<promise>COMPLETE</promise> means every task is done. The loop exits with code 0.
<promise>BLOCKED:reason</promise> means the agent needs human help. The loop exits with code 2.
<promise>DECIDE:question</promise> means it needs a decision you have to make. The loop exits with code 3.

Hitting the iteration cap without completing exits with code 1. The full mechanics of how these signals stop a run, and why a loop should stop on an explicit promise rather than a vibe, are in the guide to completion promises and exit codes.

For observability, the important part is that BLOCKED and DECIDE are not silent. When either fires, Ralph plays a notification sound and sends a desktop notification (via osascript on macOS, notify-send on Linux, or PowerShell on Windows) with the reason or the question. You can leave a loop running in another workspace and trust that your machine will get your attention the moment a human is actually needed, instead of finding a stalled run an hour later. When you do get pulled in, you often do not need to stop the loop at all. You can redirect it with a STEERING.md file that injects work into a running agent.

How to read the signals and catch a stuck loop early

Visibility only helps if you know which patterns mean trouble. Here is how the surfaces combine into early warnings.

Iteration time climbing without new commits. If durations grow but git log shows no new tasks landing, the agent is spinning. The timing line and the commit log together catch this faster than either alone.

A step breakdown dominated by Debugging or Testing. A healthy iteration spends real time in Implementing. When the per-step breakdown is mostly Debugging across several iterations, the agent is fighting the same failure. Open the latest .agent/history/ file to see which one.

The Waiting step on an unattended run. Waiting means the agent is asking for input. With nobody at the keyboard, that iteration will not progress. This is a prompt problem: the task lacks a clear completion criterion, so the agent does not know it is allowed to finish.

A BLOCKED or DECIDE notification. This is the agent doing the right thing. It hit a wall it cannot or should not pass alone, emitted the promise, and stopped with a non-zero exit code. Read the reason in the notification and in the on-screen message, fix or decide, then run ./ralph.sh again to resume.

A flat LOG.md. If the progress log stops getting new entries while the run is still going, the agent is not completing tasks. Cross-check against the live step and the history file to see where it is stuck.

When the signals point somewhere you need to inspect directly, get inside the sandbox. Ralph runs each agent in an isolated Docker Sandbox, and you can open a shell in the running box to re-run a failing command, read the logs from inside, or check the working tree:

sbx ls
sbx exec -it <sandbox-name> bash

Print the exact sandbox name for your project with ./ralph.sh --print-name. Between the live stream, the per-iteration history, the progress log, the screenshots, the timing metrics, the git log, and the notifications, an autonomous run stops being a black box. You get a system you can audit while it runs and after it finishes, which is the only honest basis for letting an agent code unattended.

Frequently asked questions

What does observability mean for an autonomous coding agent?

It means every iteration leaves a trail you can read and a live signal you can watch. For a Ralph loop that is a parsed live stream with step detection, a clean per-iteration transcript in .agent/history/, a running progress log in .agent/logs/LOG.md, screenshots per task in .agent/screenshots/, timing metrics per step and per iteration, and one git commit per task. Together they let you audit a run while it happens and after it finishes.

Where does Ralph store per-iteration logs and history?

Each iteration writes its full output with ANSI codes stripped to .agent/history/ITERATION-<session>-<n>.txt, where session is a YYYYMMDD-HHMMSS stamp taken at the start of the run so new runs never overwrite old history. The higher-level run journal lives in .agent/logs/LOG.md, which the agent appends to per task with a date, a summary, and a screenshot path, newest entry first.

How do I watch what an AI agent is doing in real time?

Ralph reads the agent stream-json output line by line and shows a spinner with the current step name plus a dimmed rolling preview of the latest line. The step comes from a classifier that maps output to fourteen named steps such as Thinking, Implementing, Testing, and Committing. To follow the full transcript live, run tail -f on the .agent/history/ file for the current iteration.

How do I know when an autonomous agent needs me?

The agent emits a promise tag in its final message. BLOCKED means it needs help and exits with code 2, DECIDE means it needs a decision and exits with code 3, COMPLETE means it finished and exits with code 0, and hitting the iteration cap exits with code 1. When BLOCKED or DECIDE fires, Ralph plays a sound and sends a desktop notification with the reason, so you can leave the loop running and still be pulled in only when a human is required.

How do I catch a stuck or thrashing agent loop early?

Watch three things together. Iteration time climbing with no new commits in git log means the agent is spinning. A per-step breakdown dominated by Debugging across iterations means it is fighting the same failure. The Waiting step on an unattended run means the task lacks a clear completion criterion. When any of these show up, open the latest .agent/history/ file or shell into the sandbox with sbx exec to see exactly where it is stuck.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough