How It Works
Every task output is written to SQLite keyed by(runId, nodeId, iteration). When you resume, Smithers re-renders the JSX tree with the persisted outputs already available in ctx. Tasks with valid output rows are marked finished and skipped. Tasks that were in-progress or pending are picked up from where they left off.
The resume flow, step by step:
- Load existing state — Smithers reads
_smithers_runs,_smithers_nodes, and_smithers_attemptsfor the givenrunId. - Metadata check — The stored workflow path, workflow file hash, and VCS metadata are compared against the current environment. If they changed, resume fails fast. This prevents you from accidentally running new code against old state.
- Stale attempt cleanup — Any in-progress attempts older than 15 minutes are automatically cancelled. This prevents zombie tasks from blocking forward progress. The associated nodes are reset to
pending. - Re-render — The JSX tree is rendered with the current
ctx, which includes all previously persisted outputs. Completed tasks are naturally skipped because their output already exists. - Resume execution — The engine schedules and executes any remaining runnable tasks.
Deterministic Node IDs
Resumability lives or dies by stable, deterministic node identity. A task’s identity comes from itsid prop:
nodeId in the database is "analyze". If you rename the id prop between runs, Smithers treats it as a new task and the old output is orphaned — sitting in the database, unused, while the “new” task starts from scratch.
Rules for stable IDs:
- Use fixed, descriptive strings for static tasks:
id="analyze",id="report". - For dynamic tasks, derive the ID from a stable identifier:
id={$:implement}. - Never use array indices or timestamps as IDs. They change between renders.
Resume via CLI
Start a run, then resume it later:--input again — it was persisted on the first run.
Resume Programmatically
resume: true is set, Smithers loads the existing run state instead of creating a new run.
What Gets Skipped on Resume
| Node state before resume | Behavior on resume |
|---|---|
finished | Skipped. Output row exists and is valid. |
skipped | Remains skipped. |
failed (retries exhausted) | Stays failed unless the workflow code changed to allow more retries. |
in-progress (stale) | Cancelled after 15 minutes, then retried as pending. |
in-progress (recent) | Left in-progress. If the process died, the attempt will time out and be cleaned up on the next resume. |
pending | Scheduled for execution. |
waiting-approval | Stays waiting. Approve or deny to unblock. |
cancelled | Stays cancelled. |
Stale Attempt Recovery
If a process crashes mid-execution, some tasks may be stuck inin-progress state with no process to complete them. Smithers handles this automatically:
- On resume, any in-progress attempt with a
started_at_msolder than 15 minutes is markedcancelled. - The associated node is reset to
pending. - The task will be picked up on the next scheduling pass.
Common Resume Scenarios
Crash during execution
Waiting for approval
Fixing a bug and retrying
If a task failed because of a bug in your workflow code, you have two options:- Fix the code and start a fresh run.
- Fix the code and resume — but only if the workflow file hash has not changed, which it has, because you just fixed it.
_smithers_runs and requires them to match on resume. This is intentional — it keeps resume deterministic. Running changed code against old state is a recipe for subtle bugs.
Database Tables
Smithers uses these internal tables for resume state. You can query them directly for debugging:Tips
- Always use stable task IDs. This is worth repeating. Changing IDs between runs breaks resume because the engine cannot match old output rows to new task nodes.
- Test resume in development. Run your workflow, cancel it partway through, and resume to verify it picks up correctly. Do this before your first production run, not after.
- Check for stale runs. Use
bunx smithers ps --status runningto find runs that may need to be resumed or cancelled. - Input immutability. Once a run starts, the input is persisted. Passing different input on resume is an error. This is by design — the input is part of the run’s identity.
Next Steps
- Debugging — Inspect run state and diagnose resume issues.
- Execution Model — Understand the render-schedule-execute loop that drives resume.
- VCS Integration — Revert filesystem changes to a specific attempt.