Spec-Driven Development With AI: PRDs, Task Lists, and Breakdown

Diagram of spec-driven development with AI: a PRD feeds a task list that drives an autonomous coding loop.

Feb 2, 2026 - 18 min read - 3800 words

Creator of RalphLoop.sh, founder of PageAI

Spec-driven development with AI means you hand a coding agent a written specification instead of a vague prompt. The spec is three artifacts: a PRD that states what to build and why, a task list that breaks the work into atomic units, and acceptance criteria that define how each unit gets verified. Give an agent that and it has something concrete to check its own work against. Give it vibes and it guesses, drifts, and ships code nobody asked for.

This post covers the full loop: how to specify, plan, break down, and implement, and how the Ralph technique runs that loop autonomously from files on disk. If you write production code with agents, this is the part of the workflow that decides whether you trust the diff in the morning.

What spec-driven development with AI actually means

A specification is a contract between you and the agent. The contract has three layers, and each one answers a different question.

The PRD answers “what are we building and why.” It captures goals, constraints, the target user, the technical stack you already committed to, and the things that are explicitly out of scope. It is the document a human reads to understand the project in five minutes.

The task list answers “what is the next unit of work.” It decomposes the PRD into small, independently shippable pieces. Each piece is something an agent can finish in one sitting without holding the whole project in its head.

The acceptance criteria answer “how do we know this unit is done.” They are verifiable conditions, not opinions. “Login works” is not a criterion. “Invalid email shows the error text Please enter a valid email” is. The agent can run the app, check the condition, and get a clear yes or no.

Vibe coding skips all three. You type a sentence, the model produces a wall of code, and you eyeball it. That is fine for a throwaway script. For a feature that ships to users, the missing contract is exactly where bugs, scope creep, and rework come from. The trade-offs between the two styles are worth their own discussion, which is why there is a dedicated breakdown of spec-driven development versus vibe coding.

The Specify, Plan, Tasks, Implement loop

The cleanest framing of this workflow comes from GitHub Spec Kit, which names four phases: Specify, Plan, Tasks, Implement. The order matters because each phase produces an artifact the next phase consumes.

flowchart LR
  Specify["Specify: what and why"] --> Plan["Plan: technical approach"]
  Plan --> Tasks["Tasks: atomic units"]
  Tasks --> Implement["Implement: build and verify"]
  Implement -->|"gaps found"| Specify

Specify is where you write down intent. Not implementation, intent. What does the user need, what does success look like, what is off limits. You are forcing the fuzzy idea in your head into prose that another person (or an agent) can read without you in the room.

Plan turns intent into a technical approach. Which framework, which data model, which existing modules you reuse instead of rewriting. This is where you make the architectural calls a coding agent should not be guessing at on its own.

Tasks decomposes the plan into a list of units, each with its own acceptance criteria. This is the phase most people rush, and it is the one that decides whether the agent succeeds. A good task is small, has clear inputs and outputs, and can be verified on its own.

Implement is where the agent actually writes code. It picks one task, builds it, verifies it against the criteria, and moves on. If implementation surfaces a gap (a requirement nobody thought of, a constraint that does not hold), you go back and amend the spec rather than letting the agent improvise.

The loop is the point. Specs are not a waterfall document you write once and freeze. You learn things during implementation, and you feed them back into the PRD and the task list. The artifacts stay alive.

How Ralph turns a spec into an autonomous run

Ralph is a Bash loop that runs an agentic coding CLI against your project until the task list is done. The technique was popularized by Geoffrey Huntley, whose original Ralph writeup described it bluntly as a Bash loop. What makes the loop work is not the shell script. It is the spec sitting on disk that the agent reads on every iteration.

Here is how the four Spec Kit phases map onto files Ralph actually reads.

flowchart TD
  Requirements["Unstructured requirements"] --> PRD["prd/PRD.md and prd/SUMMARY.md"]
  PRD --> Tasks["tasks.json lookup table"]
  Tasks --> Specs["tasks/TASK-{ID}.json specs"]
  Specs --> Loop["ralph.sh loop, fresh context each iteration"]
  Loop --> Pick["Pick highest-priority incomplete task"]
  Pick --> Work["Work the steps in one TASK-{ID}.json"]
  Work --> Verify["Run tests, lint, types, screenshot"]
  Verify --> Commit["Commit and set passes true"]
  Commit --> Done{"All tasks pass?"}
  Done -->|"no"| Loop
  Done -->|"yes"| Complete["promise COMPLETE, exit 0"]

Every iteration starts the agent with a fresh context window. It does not remember the last iteration from chat history. It rebuilds its understanding from the files. That design choice is the whole reason this scales: the spec is the memory, not the conversation.

The PRD lives in .agent/prd/PRD.md and SUMMARY.md

Ralph keeps the product specification in two files. PRD.md is the full document: app overview, target audience, success metrics, core features, technical stack, prerequisites, and security considerations. SUMMARY.md is a short executive overview that gets sent to the agent each iteration so it reorients fast without rereading the entire PRD.

The split is deliberate. The long PRD is for depth. The summary is for the agent’s working context on every pass, where tokens cost money and attention is finite. Writing a PRD an agent can actually build from is a skill of its own, covered in how to write a PRD for an AI agent.

The task lookup table: .agent/tasks.json

The root tasks.json is an index, not the detail. It is a flat list where each entry points to a spec file. Keeping it lean matters because the agent scans this list every iteration to find the next thing to do.

[
  {
    "id": "TASK-1",
    "title": "Verify project prerequisites and access",
    "category": "setup",
    "specFilePath": ".agent/tasks/TASK-1.json",
    "passes": false
  },
  {
    "id": "TASK-2",
    "title": "User table with authentication fields",
    "category": "data-model",
    "specFilePath": ".agent/tasks/TASK-2.json",
    "passes": false
  },
  {
    "id": "TASK-3",
    "title": "POST /api/auth/register creates new user account",
    "category": "api-endpoint",
    "specFilePath": ".agent/tasks/TASK-3.json",
    "passes": false
  }
]

Each task carries a passes flag. It starts false and only flips to true after the agent verifies the work. The loop reads this flag to decide what is left. A lookup table like this is what lets a run grind through dozens or hundreds of tasks without losing track, and the pattern scales further than you might expect, which is the subject of task lookup tables for agents.

TASK-1 is always reserved for prerequisite verification. Before any feature work happens, the agent confirms that environment variables exist as placeholders, database access works, required tools are authenticated, and any open gaps have an explicit proceed or block decision. You do not want an agent discovering halfway through a 50 task run that it never had database credentials.

Per-task specs: `.agent/tasks/TASK-{ID}.json`

This is where the real detail lives. Each TASK-{ID}.json file is a complete contract for one unit of work: a description, the acceptance criteria, the ordered steps, dependencies on other tasks, an estimated complexity, and technical notes.

{
  "id": "TASK-3",
  "title": "POST /api/auth/register creates new user account",
  "category": "api-endpoint",
  "description": "Implement the registration endpoint that validates input, hashes the password, stores the user, and returns a success response.",
  "acceptanceCriteria": [
    "Endpoint accepts POST with email and password",
    "Invalid email format returns 400 with a clear error message",
    "Password shorter than 8 characters returns 400",
    "Successful registration returns 201 with the user id and email",
    "Password is stored as a bcrypt hash, never plaintext"
  ],
  "steps": [
    {
      "step": 1,
      "description": "Create the register route handler",
      "details": "Add POST /api/auth/register. Extract email and password from the body, validate with a zod schema, hash with bcrypt, insert into the users table.",
      "pass": false
    },
    {
      "step": 2,
      "description": "Write tests for the endpoint",
      "details": "Add Vitest cases for valid registration, invalid email, short password, and duplicate email. Confirm the stored hash starts with $2b$ and is not the plaintext value.",
      "pass": false
    }
  ],
  "dependencies": ["TASK-1", "TASK-2"],
  "estimatedComplexity": "medium",
  "technicalNotes": [
    "Never log passwords, even in error branches",
    "Return 409 on duplicate email rather than a generic 500"
  ]
}

Notice that the tests are steps inside a task, not tasks of their own. Verification is part of the work, not an afterthought you schedule for later. The acceptance criteria are specific and checkable. “Password is stored as a bcrypt hash” can be confirmed by querying the row and reading the prefix. That is the difference between a spec and a wish.

This structure also keeps the agent honest about order. The dependencies array tells it that registration cannot start before the users table exists. The loop respects that ordering, so the agent never tries to build on a foundation that is not there yet. Decomposing a PRD into packets this clean is the heart of the work, and there is a full guide to breaking a PRD into atomic agent tasks.

Generating the spec: the prd-creator skill in plan mode

You do not have to write all of this by hand. Ralph ships a prd-creator skill designed to turn unstructured requirements into a PRD plus a task list. Run it in plan mode, where the agent is read-only and focused on asking questions rather than writing code.

The flow is a two part conversation. First the skill interviews you to fill the gaps in your description, then it produces the PRD. The instinct most people have is to dump a paragraph and expect a finished plan. The skill instead pushes back, asks clarifying questions one at a time, researches the competitive landscape, and only then writes PRD.md. When the codebase can answer a question, it reads the codebase instead of asking you.

After the PRD is approved, the same skill generates the task list. It analyzes the PRD, writes TASK-1 as the prerequisite gate, then produces a comprehensive set of small tasks, each initialized with passes: false. For a typical project that is dozens to hundreds of entries, not five. Small tasks are the design goal: if a task is too complex to finish in one short sitting, the skill splits it.

A prompt to kick this off looks like plain language:

Use the prd-creator skill in plan mode. I want to build a link shortener
with accounts, custom slugs, and click analytics. Interview me, write the
PRD to .agent/prd/PRD.md, then generate the task list in .agent/tasks.json.

The skill writes three files when it finishes: .agent/prd/PRD.md, .agent/prd/SUMMARY.md, and .agent/tasks.json, with one TASK-{ID}.json spec per task. That is the complete spec the loop needs. From there you run the loop.

npx @pageai/ralph-loop
./ralph.sh -n 50

You can keep amending later. When you want to add a feature or fix a bug mid-project, you run the prd-creator skill again to update the PRD and append tasks. The spec grows with the project instead of going stale the moment you start coding.

A worked example: from one sentence to a finished feature

Walk a real feature through the pipeline so the abstraction has edges. Say you want a link shortener with accounts, custom slugs, and click analytics. That sentence is the requirement. It is not a spec, because nobody can verify it.

In the Specify phase, the prd-creator interview pins down the parts you left implicit. Are slugs globally unique or unique per account. What happens on a slug collision. Do analytics count unique visitors or raw hits. Is there a free tier limit on links. Each answer becomes a line in the PRD, and the out-of-scope section captures what you decided not to build, like custom domains or team accounts.

In the Plan phase, the agent commits to an approach that fits your existing stack. A links table keyed by slug, a redirect route that records a click row, an analytics query that aggregates by day. This is where you reuse what already exists instead of inventing parallel systems, which is exactly the kind of architectural call you do not want an agent improvising during implementation.

In the Tasks phase, that plan becomes a list. The prerequisite check is TASK-1. The links table is a data-model task. The slug generator is a functional task with criteria like “a generated slug is 7 characters of base62” and “a collision retries up to 3 times before returning an error.” The redirect endpoint is an api-endpoint task that depends on the table. The analytics view is a ui-ux task that depends on the click data existing. Every task names its dependencies, so the loop never builds the dashboard before the data it reads exists.

In the Implement phase, the loop runs. The first iteration verifies prerequisites and stops. The second builds the table, runs the migration, confirms the schema, and stops. The third builds the slug generator with its tests, confirms the base62 length and the collision retry, and stops. Each iteration is a small, verified, committed step. By the time the loop emits its completion promise, you have a feature with a clean commit per task and a test for every criterion. You did not write a line of it, and you can still read exactly why each piece exists.

The contrast with a single vibe prompt is stark. “Build me a link shortener” produces something that looks right in the happy path and falls over on the collision case nobody specified, because nobody specified it. The spec made the edge cases explicit before any code existed, which is the only time fixing them is cheap.

One task per iteration, with verifiable acceptance criteria

The single rule that makes this reliable: one task per invocation. The agent completes exactly one task, commits, and stops. It never batches. Each loop is the lifecycle below.

flowchart TD
  Start["Fresh context"] --> Read["Read SUMMARY.md and tasks.json"]
  Read --> Steer{"STEERING.md has critical work?"}
  Steer -->|"yes"| Critical["Handle steering first"]
  Steer -->|"no"| Select["Select highest-priority incomplete task"]
  Critical --> Select
  Select --> Spec["Open TASK-{ID}.json"]
  Spec --> Build["Work the steps"]
  Build --> Gate["Tests, lint, types, screenshot"]
  Gate -->|"fail"| Build
  Gate -->|"pass"| Update["Set passes true, commit"]
  Update --> Stop["Stop. Next iteration starts clean"]

Why not let the agent power through ten tasks in one context? Because context rot is real. The longer a single session runs, the more the agent loses the plot: it forgets earlier decisions, contradicts itself, and starts editing files it already finished. A fresh context per task keeps each unit of work crisp. The reasoning and the data behind this rule are laid out in one task per iteration.

Verification is the other half. After working the steps, the agent runs the verification stack the loop assumes: Playwright for end to end, Vitest for unit tests, TypeScript for types, ESLint for lint, Prettier for format. The repo mantra is direct: if you didn’t test it, it doesn’t work. A UI task is not done until a Playwright run and a screenshot confirm it. Only after the checks pass does the agent flip passes to true, take a screenshot, and commit.

The loop does not stop on a feeling. It stops on an explicit signal. The agent emits a promise tag, and the script maps it to an exit code.

<promise>COMPLETE</promise>        all tasks finished      exit 0
<promise>BLOCKED:reason</promise>  needs human help        exit 2
<promise>DECIDE:question</promise> needs a decision        exit 3

If the loop hits its iteration cap first, it exits with code 1 (MAX_ITERATIONS). You control the cap. The default is 10 iterations, ./ralph.sh -n 50 runs up to 50, and ./ralph.sh --once runs exactly one iteration when you want to watch a single task closely before turning it loose.

If you need to redirect a running loop, edit .agent/STEERING.md. The agent checks that file each iteration and handles critical work there before resuming the task list. That is how you inject “stop and fix the failing migration” without killing the run and losing momentum.

Why this beats vibe coding for production work

Vibe coding optimizes for the first thirty seconds. You get code on screen fast. Spec-driven development optimizes for the next thirty hours, where the cost of a project actually lives.

Here is the concrete difference, point by point.

Drift. A vibe prompt gives the agent no fixed target, so it interprets, and its interpretation wanders across a long session. A spec pins the target. When the agent finishes a task, there is a checkable definition of done it cannot argue with.

Rework. Without acceptance criteria, “done” is whatever the model decided. You review, you find it built the wrong thing, you re-prompt. With criteria, the agent self-checks before it ever hands you the diff. Bad work fails the gate and never reaches your review queue.

Scope. A PRD has an explicit out-of-scope section. Vibe coding has no such boundary, so agents happily add features you never asked for and now have to maintain. The spec is a fence.

Resumability. A vibe session lives in a chat window. Close it and the context is gone. A spec lives in tasks.json and git history, so any fresh agent on any machine can pick up exactly where the last one stopped. This is what makes overnight and multi-day runs possible at all.

Auditability. With a spec, every commit maps to a task with stated criteria. Six months later you can read why a change exists. Vibe commits are archaeology.

None of this means vibe coding is useless. For a quick spike, a one-off script, or exploring an idea you will throw away, the overhead of a full spec is not worth it. The honest version of this comparison, including when to pick each one per task, is in spec-driven development versus vibe coding.

The deeper point is that an autonomous loop amplifies whatever you feed it. Feed it a vibe and it amplifies ambiguity across 50 iterations, which is how you wake up to a branch full of confident nonsense. Feed it a spec and it amplifies a clear plan, which is how you wake up to a feature with passing tests and a clean commit history. The loop is the same. The input is the variable you control.

Where to go next

If you are building this workflow, read down through the spec-driven cluster in order:

How to write a PRD an AI agent can actually build from for the goals, constraints, and acceptance criteria a coding agent can execute against.
Breaking a PRD into atomic agent tasks for decomposition into independently verifiable packets.
One task per iteration for the rule that keeps long runs reliable.
Task lookup tables for agents for scaling to hundreds of tasks.
Spec-driven development versus vibe coding for choosing the right approach per task.

For the broader mechanics of the loop itself, the fresh-context design, and where the technique came from, start with what is the Ralph technique.

Frequently asked questions

What is spec-driven development with AI?

It is a workflow where you hand an AI coding agent a written specification instead of a vague prompt. The spec has three parts: a PRD that states what to build and why, a task list that breaks the work into small units, and acceptance criteria that define how each unit is verified. The agent checks its own work against the criteria instead of guessing.

How is spec-driven development different from vibe coding?

Vibe coding means typing a sentence and reviewing whatever the model produces. It is fast for throwaway scripts and risky for production because there is no fixed definition of done. Spec-driven development gives the agent verifiable acceptance criteria, an out-of-scope boundary, and a task list, which reduces drift, rework, and scope creep on real features.

What is the Specify, Plan, Tasks, Implement loop?

It is the four-phase framing named by GitHub Spec Kit. Specify captures intent, Plan turns intent into a technical approach, Tasks decomposes the plan into atomic units with acceptance criteria, and Implement builds and verifies them. It is a loop rather than a waterfall because gaps found during implementation feed back into the spec.

How does Ralph store the spec on disk?

The PRD lives in .agent/prd/PRD.md with a short overview in .agent/prd/SUMMARY.md. The task lookup table is .agent/tasks.json, and each task has a detailed spec file at .agent/tasks/TASK-{ID}.json with description, acceptance criteria, ordered steps, and dependencies. The agent reads these files fresh on every iteration, so the filesystem is the memory.

How do I turn requirements into a PRD and task list?

Use the prd-creator skill in plan mode. It interviews you to fill gaps, writes PRD.md and SUMMARY.md, then generates a comprehensive tasks.json starting with a prerequisite verification task. Every task starts with passes set to false. You can run the skill again later to amend the PRD and append new tasks as the project grows.

Run your own Ralph loop

Ralph is a hackable script you point at your project. Install it and let an agent work through your task list.

npx @pageai/ralph-loop

Install from npm Star on GitHub Watch the walkthrough