Resumability

Smithers persists all run state to SQLite. When a process crashes, times out, or is cancelled, you can resume the exact same run and Smithers will skip tasks that already wrote valid output rows. This guide explains how resumability works, what gets skipped, and how to use it from the CLI and programmatically.

How It Works

Every task output is written to SQLite keyed by (runId, nodeId, iteration). When you resume a run, Smithers re-renders the JSX tree with the persisted outputs already available in ctx. Tasks that already have valid output rows are marked finished and skipped. Tasks that were in-progress or pending are picked up from where they left off. The resume flow:

Load existing state — Smithers reads _smithers_runs, _smithers_nodes, and _smithers_attempts for the given runId.
Stale attempt cleanup — Any in-progress attempts older than 15 minutes are automatically cancelled. This prevents zombie tasks from blocking forward progress. The associated nodes are reset to pending.
Re-render — The JSX tree is rendered with the current ctx, which includes all previously persisted outputs. Completed tasks are naturally skipped because their output exists.
Resume execution — The engine schedules and executes any remaining runnable tasks.

Deterministic Node IDs

Resumability relies on stable, deterministic node identity. A task’s identity comes from its id prop:

<Task id="analyze" output="analysis" agent={analyst}>
  Analyze the codebase.
</Task>

The nodeId in the database is "analyze". If you rename the id prop between runs, Smithers treats it as a new task and the old output is orphaned. Rules for stable IDs:

Use fixed, descriptive strings for static tasks: id="analyze", id="report".
For dynamic tasks, derive the ID from a stable identifier: id={$:implement}.
Never use array indices or timestamps as IDs — they change between renders.

Resume via CLI

Start a run, then resume it later:

# Start the run
bunx smithers run workflow.tsx --run-id my-run --input '{"description": "Fix auth bugs"}'

# Process crashes or is cancelled...

# Resume the same run
bunx smithers resume workflow.tsx --run-id my-run

On resume, the input row must already exist in the database. Smithers will throw an error if it is missing. You do not need to pass --input again.

Resume Programmatically

import { runWorkflow } from "smithers-orchestrator";
import workflow from "./workflow";

// Initial run
const result1 = await runWorkflow(workflow, {
  runId: "my-run",
  input: { description: "Fix auth bugs" },
});

// result1.status might be "failed" or "waiting-approval"

// Resume the same run later
const result2 = await runWorkflow(workflow, {
  runId: "my-run",
  resume: true,
});

// result2 picks up from where result1 left off

When resume: true is set, Smithers loads the existing run state instead of creating a new run.

What Gets Skipped on Resume

Node state before resume	Behavior on resume
`finished`	Skipped. Output row exists and is valid.
`skipped`	Remains skipped.
`failed` (retries exhausted)	Stays failed unless the workflow code changed to allow more retries.
`in-progress` (stale)	Cancelled after 15 minutes, then retried as `pending`.
`in-progress` (recent)	Left in-progress. If the process died, the attempt will time out and be cleaned up on the next resume.
`pending`	Scheduled for execution.
`waiting-approval`	Stays waiting. Approve or deny to unblock.
`cancelled`	Stays cancelled.

Stale Attempt Recovery

If a process crashes mid-execution, some tasks may be stuck in in-progress state with no process to complete them. Smithers handles this automatically:

On resume, any in-progress attempt with a started_at_ms older than 15 minutes is marked cancelled.
The associated node is reset to pending.
The task will be picked up on the next scheduling pass.

This 15-minute threshold prevents cleaning up attempts that are still legitimately running (for example, long agent calls).

Common Resume Scenarios

Crash during execution

# Start a run -- crashes midway through "implement"
bunx smithers run workflow.tsx --run-id run-1 --input '{"repo": "/my-project"}'

# "analyze" finished, "implement" was in-progress, "report" was pending
# Resume picks up from "implement"
bunx smithers resume workflow.tsx --run-id run-1

Waiting for approval

# Run pauses at an approval gate
bunx smithers run workflow.tsx --run-id run-2 --input '{"repo": "/my-project"}'
# Status: waiting-approval

# Approve the pending node
bunx smithers approve workflow.tsx --run-id run-2 --node-id deploy

# Resume to continue execution
bunx smithers resume workflow.tsx --run-id run-2

Fixing a bug and retrying

If a task failed because of a bug in your workflow code:

Fix the code in your workflow file.
Resume the run. The failed task’s node may be re-evaluated if the retry count changed, or you can start a fresh run.

# Original run failed at "analyze" because of a prompt bug
# Fix the prompt in workflow.tsx, then start a new run
bunx smithers run workflow.tsx --input '{"repo": "/my-project"}'

Note: Resuming a run with changed workflow code is supported but may produce unexpected results if task IDs changed or the execution graph is significantly different. For major code changes, prefer starting a fresh run.

Database Tables

Smithers uses these internal tables for resume state. You can query them for debugging:

# View run status
sqlite3 smithers.db "SELECT run_id, status, created_at_ms FROM _smithers_runs WHERE run_id = 'my-run';"

# View node states
sqlite3 smithers.db "SELECT node_id, status, iteration FROM _smithers_nodes WHERE run_id = 'my-run' ORDER BY updated_at_ms;"

# View attempts
sqlite3 smithers.db "SELECT node_id, attempt, status, started_at_ms FROM _smithers_attempts WHERE run_id = 'my-run' ORDER BY started_at_ms;"

Tips

Always use stable task IDs. Changing IDs between runs breaks resume because the engine cannot match old output rows to new task nodes.
Test resume in development. Run your workflow, cancel it partway through, and resume to verify it picks up correctly.
Check for stale runs. Use bunx smithers list workflow.tsx --status running to find runs that may need to be resumed or cancelled.
Input immutability. Once a run starts, the input is persisted. Passing different input on resume is an error.

Next Steps

Debugging — Inspect run state and diagnose resume issues.
Execution Model — Understand the render-schedule-execute loop that drives resume.
VCS Integration — Revert filesystem changes to a specific attempt.

Getting Started

Core Concepts

Components

Guides

Runtime

CLI

Integrations

Examples

Reference

How It Works

Deterministic Node IDs

Resume via CLI

Resume Programmatically

What Gets Skipped on Resume

Stale Attempt Recovery

Common Resume Scenarios

Crash during execution

Waiting for approval

Fixing a bug and retrying

Database Tables

Tips

Next Steps

Getting Started

Core Concepts

Components

Guides

Runtime

CLI

Integrations

Examples

Reference

​How It Works

​Deterministic Node IDs

​Resume via CLI

​Resume Programmatically

​What Gets Skipped on Resume

​Stale Attempt Recovery

​Common Resume Scenarios

​Crash during execution

​Waiting for approval

​Fixing a bug and retrying

​Database Tables

​Tips

​Next Steps

How It Works

Deterministic Node IDs

Resume via CLI

Resume Programmatically

What Gets Skipped on Resume

Stale Attempt Recovery

Common Resume Scenarios

Crash during execution

Waiting for approval

Fixing a bug and retrying

Database Tables

Tips

Next Steps