Post-Failure Autopsies

When a run fails, something already happens. The moment bunx smithers-orchestrator up or bunx smithers-orchestrator workflow run reports a failed run, the CLI launches the post-failure system workflow in the background against that run. It gathers the run’s state, events, and workflow source, has an agent investigate the root cause, and produces a verdict: what broke, how sure it is, and the exact command(s) to run next. The trigger prints one line so you can follow along:

[smithers] Run failed. Post-failure autopsy launched: post-failure-abc123. Watch it with `bunx smithers-orchestrator inspect post-failure-abc123` (opt out with --no-post-failure or SMITHERS_POST_FAILURE=0).

Read the verdict once the autopsy finishes:

bunx smithers-orchestrator output post-failure-abc123 output

What the autopsy does

The post-failure workflow runs four steps against the failed run:

Gather (deterministic): inspect, events, and the workflow source for the failed run, plus the Smithers version.
Investigate (agent, read-only tools): digs through the evidence, re-runs read-only CLI commands, and reads the workflow. It never mutates anything: no retries, no rewinds, no edits.
Bug gate (only when it suspects Smithers itself): an Approval pauses the run and asks you before anything is reported.
Verdict: a stable output row with the failure class, root cause, suggestion, and commands.

Failure classes

The investigator classifies every failure as one of:

Class	Meaning
`workflow-bug`	The workflow script or prompts are at fault (bad schema, wrong `deps`/`needs`, a compute task throwing).
`environment`	A missing CLI, auth, network, or disk problem on this machine.
`agent-flake`	A transient provider fault (rate limit, 5xx, timeout) that a re-run would likely clear.
`smithers-bug`	Smithers itself misbehaved: an engine/CLI/component defect. Chosen conservatively, only when the evidence points into Smithers code.
`unknown`	Evidence too thin to say.

Suggestions

The verdict carries one suggestion plus the exact commands implementing it:

retry: transient; re-run the failed task (bunx smithers-orchestrator retry-task) or the whole run.
resume: the run can continue from where it stopped.
rewind: state is bad but an earlier frame is good (bunx smithers-orchestrator rewind).
edit-workflow-and-reset: the workflow needs a fix first; the verdict names the exact edit. Never edit a script while its run is resumable (that causes RESUME_METADATA_MISMATCH); make the edit, then start a fresh run.
fix-environment: the exact install/auth/config fix.
escalate: a human must decide; the verdict says what to look at.

The autopsy only ever suggests. It never retries, rewinds, or edits the failed run itself.

Reporting Smithers bugs (approval-gated)

When the investigation concludes the failure is a bug in Smithers itself (not your workflow or environment), the autopsy pauses on an Approval gate explaining what it thinks the bug is. Nothing is sent anywhere without your explicit approval. If you approve, the workflow files the report with bunx smithers-orchestrator bug, which POSTs to https://bug.smithers.sh/api/bugs and records the returned bug id and URL in the verdict. If you deny, the verdict is kept and nothing is reported. You can also file a report by hand at any time:

bunx smithers-orchestrator bug --run <failed-run-id>

bunx smithers-orchestrator bug attaches the run’s workflow name, status, error, and recent events (secrets scrubbed) along with the Smithers version and platform. See the CLI catalog for its flags.

Opting out

The trigger is on by default. Turn it off with either:

the --no-post-failure flag on bunx smithers-orchestrator up / bunx smithers-orchestrator workflow run, or
the SMITHERS_POST_FAILURE=0 environment variable.

The trigger also skips itself automatically:

when the failing workflow is post-failure itself or another ops workflow (triage-run, monitor, monitor-smithers, init), so autopsies never recurse; the launched autopsy runs with SMITHERS_POST_FAILURE=0 in its environment for the same reason;
when the post-failure workflow is not installed. Then it prints the manual command instead:

bunx smithers-orchestrator workflow run post-failure --input '{"targetRunId":"<failed-run-id>"}'

Run bunx smithers-orchestrator init to install the workflow pack, which includes post-failure as a hidden system workflow (it does not appear in bunx smithers-orchestrator workflow list without --system, but is always runnable by id).

​What the autopsy does

​Failure classes

​Suggestions

​Reporting Smithers bugs (approval-gated)

​Opting out

What the autopsy does

Failure classes

Suggestions

Reporting Smithers bugs (approval-gated)

Opting out