Skip to content
RALPH LOOP

Verification Loops: Why Autonomous Agents Need Tests and Screenshots

Diagram of an autonomous coding agent running tests and screenshots, reading the failures, and feeding them back into the next iteration.

If you didn’t test it, it doesn’t work. That one rule is what makes autonomy safe. An AI agent that writes code with no way to check the result is just generating plausible text. The same agent wired to a verification loop (run the tests, read the failures, fix them, run again) can grade its own work and only mark a task done when the evidence says so. Verification is not a nice extra on top of an autonomous loop. It is the feedback signal the loop runs on.

This post is about that signal. What the verification stack looks like, why type checks and lint are the cheap gates you run first, why screenshots are the proof for UI work, and how the agent feeds a failing test back into the next iteration instead of declaring victory. The short version: every task ends with a result a machine can confirm, and the agent reads that result before it moves on.

Why verification is what makes autonomy safe

Section titled “Why verification is what makes autonomy safe”

The danger of an autonomous agent is not that it writes bad code once. It is that it writes bad code, believes the code is fine, marks the task done, and builds the next ten tasks on top of the broken one. By the time you wake up, the loop has compounded a small mistake into a tangled diff. Verification is the thing that stops compounding. It forces the agent to confront reality at the end of every task.

An agent left to self-assess on vibes will tell you it is confident. Confidence is not evidence. A failing assertion is evidence. A red type error is evidence. A screenshot that shows the button in the wrong place is evidence. The job of a verification loop is to replace the agent’s opinion of its work with a result that came from running the work.

This is why the loop architecture and the verification stack are inseparable. The overnight run pillar covers how a Ralph loop keeps an agent productive for hours by resetting context and storing state on disk. None of that matters if the recorded state is a lie. Verification is what keeps status: done honest, so a fresh agent in the next iteration can trust the disk instead of re-checking everything. The mantra in the repo is blunt: if you didn’t test it, it doesn’t work.

In a Ralph loop, each iteration follows the same shape, and verification sits in the middle of it:

  1. Find the highest-priority incomplete task in .agent/tasks.json.
  2. Work the steps in .agent/tasks/TASK-{ID}.json.
  3. Run tests, linting, and type checking.
  4. Complete the task, take a screenshot, update task status, and commit.
  5. Repeat until all tasks pass or the iteration cap is reached.

Step 3 is the gate. A task does not reach step 4 until step 3 is green. That single ordering is what separates an autonomous loop you can leave running from a code generator you have to babysit.

The verification stack: tests, types, lint, format, screenshots

Section titled “The verification stack: tests, types, lint, format, screenshots”

The verification stack the loop assumes is five tools, each catching a different class of mistake. You run them as gates, fastest and cheapest first, so the agent gets a signal in seconds instead of waiting on a full browser run for a problem a type check would have caught.

Run the static checks first because they are fast and they catch the dumbest mistakes. A type error or a lint failure tells the agent the code is wrong before a single test boots a runtime.

Terminal window
# Type check: no emit, just verify the types hold
npx tsc --noEmit
# Lint: catch unused vars, bad imports, banned patterns
npx eslint .

These run in seconds and they fail loud. An agent that renamed a function but missed a caller gets a type error pointing at the exact file and line. That is a precise signal the agent can act on without guessing. Cheap gates first means the expensive gates (the browser tests) only run on code that already passes the basics.

Formatting is not about taste in an autonomous loop. It is about keeping the morning diff readable. If every iteration reformats the file its own way, your git diff fills with noise and you cannot see what actually changed.

Terminal window
# Verify formatting without writing changes
npx prettier --check .

Run this as a gate and the agent is forced to leave the code in the canonical format. The reviewer (you) gets a diff that shows logic changes, not whitespace churn.

Unit tests are where the agent proves the logic does what the task said. Vitest runs fast enough to run on every iteration, which is the property that matters. A test suite you only run nightly is not part of the feedback loop.

Terminal window
# Run the unit suite once, no watch mode
npx vitest run

A unit test failure gives the agent the most actionable signal of all: an expected value, an actual value, and the exact assertion that broke. The agent reads that diff and knows precisely what its change got wrong. This is the difference between “something is off” and “the function returned 3 when the test expected 4”.

For anything a user clicks through, unit tests are not enough. Playwright drives a real browser, so the agent verifies the flow end to end: navigate, fill the form, submit, assert the result on screen.

Terminal window
# Run the end-to-end suite headless
npx playwright test

End-to-end tests catch the class of bug that passes every unit test and still breaks in the browser: a missing prop, a broken route, a handler wired to the wrong element. They are slower, which is exactly why they sit last in the gate order. By the time Playwright runs, the cheap checks have already filtered out the obvious failures.

A test that passes tells you the DOM is correct. It does not tell you the page looks right. For UI work, the agent takes a screenshot and that screenshot is the proof. It is the one artifact that lets a human (or a vision-capable agent) confirm the thing actually renders the way the task described.

Playwright captures screenshots as part of a test run, so this folds into the same gate:

// Inside a Playwright test, capture proof of the rendered state
await page.goto('/dashboard');
await expect(page.getByRole('button', { name: 'Save' })).toBeVisible();
await page.screenshot({ path: 'artifacts/dashboard.png', fullPage: true });

Two reasons screenshots earn their place in the loop. First, they catch what assertions miss. A button can be present in the DOM, pass every toBeVisible check, and still sit behind a modal or off the edge of the viewport. The screenshot shows it. Second, they are the audit trail. When you review a long autonomous run in the morning, the screenshots are how you confirm each UI task landed without re-running anything yourself. That auditability is the same idea covered in observability for autonomous coding agents: you cannot trust what you cannot see, and a screenshot is the cheapest way to see it.

Screenshots and type checks sit at opposite ends of the cost spectrum. Type and lint checks are nearly free and run constantly. A full browser screenshot is expensive and runs once per UI task at the end. You want both: the cheap gates to fail fast on logic, the screenshot to confirm the pixels.

How verification results feed the next iteration

Section titled “How verification results feed the next iteration”

Here is the part that turns verification from a checkbox into a loop. The agent does not just run the gates and pass or fail. It reads the failing output and feeds it back into the next attempt. A stack trace, a failed assertion, a type error: each is structured feedback the agent uses to make the specific fix, then it runs the gates again.

flowchart TD
  Task["Read task spec and acceptance criteria"]
  Implement["Implement the change"]
  Verify["Run gates: tsc, eslint, prettier, vitest, playwright"]
  Pass{"All gates green?"}
  ReadFail["Read failing output: stack trace, assertion, type error"]
  Fix["Make the targeted fix"]
  Screenshot["Capture screenshot for UI work"]
  Commit["Update tasks.json, commit"]
  Task --> Implement --> Verify --> Pass
  Pass -->|no| ReadFail --> Fix --> Verify
  Pass -->|yes| Screenshot --> Commit

The inner cycle (verify, read failure, fix, verify) is the verification loop proper. It can run several times inside a single task before the gates go green. This is the same reason iteration beats a single shot in general: the agent sees its own mistake and corrects it instead of guessing once and hoping. The Ralph loop vs one-shot prompting comparison makes that case directly. A one-shot prompt has no failing test to read, so it cannot self-correct. A verification loop hands the agent a precise error message and a chance to act on it.

The quality of the feedback decides how well this works. Good test output names the file, the line, and the difference between expected and actual. That precision is what lets a fresh-context agent fix a bug it has never seen, because the failure itself carries enough information to locate and correct the problem. Vague output (a generic “something failed” with no detail) starves the loop of signal and the agent thrashes. Invest in error messages and assertions that say exactly what went wrong.

This also connects to how the loop keeps its memory honest. Because each task only commits after the gates pass, the git history and .agent/tasks.json become a trustworthy record. A later iteration that reads status: done does not need to re-verify; the commit is the receipt. That is the discipline described in context engineering for long-running agents, where the filesystem and git log are the memory layer. Verification is what makes that memory worth trusting.

How do you design machine-checkable acceptance criteria?

Section titled “How do you design machine-checkable acceptance criteria?”

Verification only works if the task tells the agent what “done” means in terms a machine can check. An acceptance criterion like “the login page should work well” is useless to a loop. There is nothing to run. A criterion like “submitting valid credentials redirects to /dashboard and the Vitest suite for auth passes” is checkable. The agent can run it and get a yes or no.

The rule of thumb: every acceptance criterion in a .agent/tasks/TASK-{ID}.json spec should map to a gate. Write criteria that a test, a type check, or a screenshot can confirm. If you cannot point at the command that proves a criterion, the criterion is too vague to belong in an autonomous task.

What machine-checkable criteria look like in practice:

  • Bind to a named test. “The new parsePrice unit test passes” beats “prices parse correctly”. The agent runs npx vitest run and reads the result.
  • Bind to an end-to-end flow. “A user can add an item to the cart and the cart count shows 1” maps to a Playwright spec the agent can run.
  • Bind to a screenshot. “The settings page renders with the dark-mode toggle visible” is confirmed by capturing the page and checking the toggle is in frame.
  • Bind to the static gates. “No type errors, no lint errors, formatting clean” is the floor every task clears.

Criteria written this way are what let the agent finish a task without asking you. It knows it is done because the gates it was given are green. This is the same property that makes one task per iteration reliable: an atomic task with checkable criteria is a unit the agent can verify on its own, commit, and move past. A task with fuzzy criteria forces the agent to guess at completion, which is exactly when an autonomous loop goes off the rails.

A useful habit is to write the test first as part of the spec. When the task spec includes the failing test the agent must make pass, the acceptance criterion and the verification command are the same thing. The agent’s whole job becomes “turn this red test green”, and the loop can confirm completion mechanically.

Putting it together: verification as the loop’s nervous system

Section titled “Putting it together: verification as the loop’s nervous system”

Verification is not a phase that runs after the work. It is the nervous system the autonomous loop runs through. Strip it out and every other part of the architecture loses its meaning. Fresh context per iteration is pointless if the agent records unverified work. Disk-based memory is a liability if the state on disk is wrong. Atomic tasks do not help if “done” is a guess.

The stack does the catching, cheapest gate first:

  • TypeScript and ESLint fail in seconds on the obvious mistakes.
  • Prettier keeps the diff clean so the morning review is fast.
  • Vitest proves the logic with precise, actionable assertions.
  • Playwright proves the user flow in a real browser.
  • Screenshots prove the UI renders the way the task described.

The loop does the correcting: run the gates, read the failures, make the targeted fix, run again, and only commit when everything is green. Acceptance criteria that bind to those gates are what let the agent grade itself and a human trust the result without redoing it.

Get this right and an autonomous run stops being a leap of faith. Every task in the morning diff is backed by a green suite and a screenshot. You are not trusting the agent. You are trusting the evidence it was forced to produce. That is what makes it safe to close the laptop and let the loop run.

Frequently asked questions

What is a verification loop for an AI coding agent?

A verification loop is the cycle where an agent implements a change, runs automated gates like type checks, lint, unit tests, and end-to-end tests, reads any failures, makes a targeted fix, and runs the gates again until they pass. Only then does it mark the task done and commit. The verification result, not the agent confidence, decides when a task is finished.

Which tests should an autonomous agent run on every iteration?

Run the cheap static gates first because they are fast: TypeScript with no emit to catch type errors, ESLint for code issues, and Prettier in check mode for formatting. Then run Vitest for unit logic and Playwright for end-to-end flows. Run them in that order so the slow browser tests only execute on code that already passes the basics.

Why do AI agents take screenshots when they finish UI work?

A passing test confirms the DOM is correct but not that the page looks right. A screenshot is proof that the UI actually renders the way the task described, catching things like an element hidden behind a modal or pushed off the viewport. Screenshots also serve as an audit trail, so a human reviewing a long run can confirm each UI task landed without re-running anything.

How do verification results feed back into the next iteration?

The agent reads the failing output: a stack trace, a failed assertion with expected and actual values, or a type error pointing at a file and line. That precise signal tells the agent what to fix. It makes the targeted change and runs the gates again. Because a task only commits after the gates pass, later iterations can trust the recorded status instead of re-checking the work.

What makes an acceptance criterion machine-checkable?

A machine-checkable criterion maps to a command that returns a yes or no. Bind each criterion to a named test, an end-to-end flow, a screenshot, or the static gates. For example, submitting valid credentials redirects to /dashboard and the auth Vitest suite passes is checkable, while the login page should work well is not. If you cannot point at the command that proves a criterion, it is too vague for an autonomous task.