Agent Operating Playbook

This page is for the AI agent operating Smithers on a human’s behalf. It belongs to the Technical API docs set, which is written for agents, not humans. The human-facing docs are the Product API, starting at What Smithers Is; send humans there. Agents should read this page and the generated llms.txt / llms-full.txt bundles before driving Smithers for a user. The human does not use Smithers by memorizing CLI commands or authoring .tsx workflows. The human talks to you. You decide when Smithers is the right tool, you run the commands, you watch the run, you ask for account-gated decisions, and you return a clear report with evidence. If you remember one rule, remember this:

Do not ask the human to run Smithers commands. The human’s job is to state the outcome, answer product questions, approve gates, and provide credentials or account access when needed. Your job is to operate the harness.

And one more rule that is just as important:

You are an orchestrator, not an implementer. Do the background work through Smithers, not through your own ad-hoc subagents. For anything long-running, multi-step, retryable, or run-while-the-human-is-away, launch a Smithers workflow. Smithers spawns the worker agents and persists every step. Spend your time observing the run, clearing gates, and reporting. If you want parallel help, point your own subagents at monitoring the Smithers run (tailing events, summarizing, flagging gates), never at re-doing the work a workflow should own. The moment you’re tempted to spawn a subagent to “go build/fix/research this in the background,” that is the signal to run a workflow instead.

And one rule about the operator boundary:

The workspace Gateway is the run control plane. A controller, Bun cron job, monitor, bot, or custom client must use smithers-orchestrator/gateway-client or Gateway RPC/REST for run discovery, health, events, launch/resume/cancel, approvals, signals, scores, and node output. Never open SQLite, PGlite, or Postgres from an operator script; never query _smithers_*; never import openSmithersStore or CLI-internal findAndOpenDb; and never probe stores by adding --backend to ps or inspect. Direct store access is reserved for runtime implementation, migration, and maintainer diagnostics. One-shot public CLI commands are fine; durable automation goes through the Gateway.

And a rule about how hard to push before involving the human:

Drive workflows to completion; fix what you can yourself. When a run is launched to run to completion (especially “in the background” or while the human is away), it is your job to make it succeed end-to-end. Try your hardest. If the run fails or stalls on anything you are capable of fixing (a bad model id, a non-executable binary, a stale/polluted baseline, an over-strict or mis-wired gate, a verdict that is written to disk but not captured into the workflow’s done-check, a bug in a generated workflow), fix it and resume the run. Keep a keeper/supervisor loop alive so the run survives owner-exits, and re-run after each fix. Only stop and involve the human for (a) an explicit human approval gate in the workflow, or (b) something only a human can do (credentials, account access, irreversible outward actions). Do not hand a self-fixable issue back to the human and wait. A human who said “run it to completion” and walked away expects a finished result, not a report asking permission to apply an obvious fix. Diagnosing the problem is not the finish line; a working run is.

The operating loop

Use this loop for broad, ambiguous, risky, long-running, or multi-agent work:

Capture the word barf. Let the human describe the outcome in messy language.
Grill for missing context. Ask focused questions only when the answer cannot be discovered safely from the repo, docs, services, or prior artifacts.
Convert the request into a goal-based spec. Define done, non-goals, acceptance criteria, risks, and the evidence the human needs to see.
Design the Smithers run. Decide the workflow, agents, gates, retry loops, observability, assumption tests, and report artifacts before you start.
Validate the workflow shape. Render the graph with bunx smithers-orchestrator graph <workflow> (there is no up --dry-run; the graph verb is the dry-run path, rendering the graph without executing or persisting anything) or dry-run evals before launching expensive or destructive work.
Run with observability. Use hot reload while authoring, inspect the run while it executes, and suggest the UI when a visual state would help the human.
Report with evidence. Produce a concise Markdown or HTML report that links to outputs, tests, traces, screenshots, GIFs, and the run ID.

This is the “make a harness that makes the app” pattern: the first deliverable is not just code. It is a durable system that can plan, build, verify, observe, and explain the code.

Translate human prompts into Smithers work

Human prompt	What you should do
”Build this product idea start to finish. I have thoughts but not a spec.”	Run an interview or `grill-me` flow first. Produce a product spec, design spec, engineering spec, and acceptance criteria. Add a gate before implementation.
”Add rate limiting and don’t stop until it is production-ready.”	Run an implementation workflow with a test and review loop. Define production-ready as passing tests, review approval, docs updates, and an evidence report.
”Figure out whether Privy server wallets can deposit into a Morpho vault on Tempo.”	Treat it as an assumption-probe workflow. Write a tiny reproducible test against testnet or documented APIs before any product work depends on it. Report exact evidence and remaining unknowns.
”Make the UI look like the design and show me it actually works.”	Build the UI, run browser or simulator checks, capture screenshots or GIFs for each important screen, then ask an independent reviewer agent to compare against the design language.
”Keep working on flaky tests while I am away.”	Start a durable loop such as `ralph`, `debug`, or a local workflow with a clear cap or cancellation path. Monitor progress, summarize failures, and stop only when the finish line is reached or the cap is hit.
”Migrate this subsystem, but show me the plan first.”	Run research and planning first, then pause on an approval gate. After approval, execute milestones in worktrees and merge only validated chunks.
”Something went wrong in the run. What happened?”	Run `why`, inspect events and node output, summarize the blocker, propose options, and continue operating. Do not ask the human to debug from the terminal.

Route a most-trivial edit that takes only a few turns directly. Route any clear single-goal task through smithers oneshot, which launches quickly in the background with a live chat and diff UI; one strong agent finishes repo-wide goals of up to roughly 300k tokens in a single oneshot run, so a large goal alone never justifies a workflow. Use a full workflow when the task is genuinely multi-goal in shape: staged phases, approval gates, durable loops, parallel work, or a need for reuse. Clarify ambiguous goals before launching either form.

Context engineering

Context engineering is the work of turning a vague request into a runnable, auditable job. Start by writing down:

Outcome: what should exist when the run is done.
Finish line: how you will know the work is done.
Evidence: what the human needs to see to trust the result.
Constraints: files, platforms, budgets, style, deadlines, and non-goals.
Unknowns: assumptions that must be proven before you build on them.

Then gather context before executing:

Read repo docs, README files, package scripts, tests, issue trackers, design docs, and previous Smithers outputs.
Inspect relevant source files and architecture before making a plan.
Read third-party docs or APIs when behavior could have changed.
Prefer small probes over confident guesses for external services.
Store the resulting spec somewhere durable, such as .smithers/specs/, docs/, or an artifact directory, so later agents can consume it.

Good Smithers prompts are goal-based, not instruction soup:

Implement account-level rate limiting for API routes.

Finish line:
- Existing tests pass.
- New tests prove per-account and per-IP limits.
- The review approves the diff.
- The final report explains changed files, behavior, and rollout risks.

Verification:
- Run lint/typecheck/unit tests.
- Add an assumption test if the existing rate-limit library behavior is unclear.
- Capture failure output and feed it back into the next iteration.

Use explicit stop conditions. “Keep going until tests pass” should also carry a cap, a fallback, and a report path. Infinite effort is not a finish line.

Backpressure verification

Backpressure means the workflow pushes evidence back against the agent’s claim that the task is done. Do not accept “looks good” as verification. Encode checks that can fail. Use these Smithers patterns:

<CheckSuite> for parallel command or agent checks with one pass/fail verdict.
<ScanFixVerify> for scan -> fix -> verify -> report loops.
<ReviewLoop> or <LoopUntilScored> when the exit condition is reviewer approval or a score threshold.
Eval suites for repeatable workflow-level regressions with JSON reports.
Task scorers for telemetry such as schema adherence, faithfulness, relevance, latency, and custom LLM-judge checks.

A strong run defines backpressure before execution:

Before implementing:
- Identify which tests should fail before the fix.
- Add or update the smallest regression test that proves the behavior.
- Define an independent reviewer prompt that can reject the diff.
- Define a report schema: changed files, commands run, failures, fixes, evidence.

Backpressure should be independent where possible. The agent that wrote the code should not be the only judge. Use a second reviewer agent, command-based tests, eval cases, or real service probes.

Assumption tests

Assumption tests are small probes that prove third-party libraries, APIs, cloud services, entitlements, or chains behave the way the plan assumes. Write them before the main build when the assumption is expensive to unwind. Examples:

Assumption	Probe before building on it
”This SDK supports the chain we need.”	Write a tiny script that imports the SDK, constructs the target chain, reads a known contract, and records the result.
”The testnet faucet funds the account we will use.”	Generate a throwaway address, call the faucet or RPC method, poll balance, and save the transaction or response.
”A vault exists with real liquidity.”	Query the vault contract or API, check assets, total assets, curator identity, deposit limits, and share math.
”The mobile entitlement allows this alarm behavior.”	Build the smallest native sample or simulator test that schedules and observes the alarm path.
”The payment provider gives us idempotent retries.”	Run a local or sandbox integration test that retries the same idempotency key and proves no duplicate charge path.
”The media API can generate the assets we need.”	Call the sandbox API with one prompt, validate format, duration, latency, and failure handling, then store the output.

Keep assumption probes narrow. They should answer one question and produce evidence. If the probe fails, report that the product plan must change before implementation continues.

Observability-first runs

If you cannot see the run, you cannot operate it well. For local and development work, use the CLI surfaces yourself:

bunx smithers-orchestrator ps
bunx smithers-orchestrator inspect RUN_ID --watch
bunx smithers-orchestrator events RUN_ID --watch
bunx smithers-orchestrator node NODE_ID --runId RUN_ID
bunx smithers-orchestrator scores RUN_ID
bunx smithers-orchestrator why RUN_ID

For any long-lived observer or controller, start from the workspace singleton and use its typed API:

bunx smithers-orchestrator gateway status --format json

The status response provides the verified url for the current workspace. If no singleton is running, start bunx smithers-orchestrator gateway under the controller’s service manager, then create a SmithersGatewayClient with that URL. Do not assume port 7331 and do not parse the Gateway runtime state file; gateway status performs workspace and process identity verification for you. Use getRun/listRuns for snapshots and streamRunEventsResilient for live health instead of polling storage files. If Gateway startup reports SMITHERS_MIGRATION_REQUIRED, stop there and perform the explicit smithers migrate operation (after preserving the legacy store), then restart the Gateway. Do not delete the database, pin a different backend in a monitoring script, or pass --backend to read/control commands to make the error disappear. Backend selection is a Gateway boot/deployment concern; once the Gateway is healthy, every operator uses the same API regardless of whether the store behind it is SQLite, PGlite, or Postgres. Use serve mode when you need HTTP status, SSE events, remote approvals, or Prometheus metrics:

bunx smithers-orchestrator up workflow.tsx --serve --metrics --port 7331

Use the observability stack when the work needs traces, metrics, dashboards, or a reviewer evidence bundle:

bunx smithers-orchestrator observability

Enable OpenTelemetry export when you need trace-level proof, then include the Grafana, Loki, Tempo, or Prometheus links and query results in the final report. For debugging, correlate run ID, node ID, attempt, event stream, agent trace, and any application logs. When the human would benefit from seeing the work, suggest the UI and operate it for them:

bunx smithers-orchestrator monitor [RUN_ID] opens the Smithers Monitor: a zero-setup live view over every run in the workspace (grouped runs, execution tree, event log, approvals inbox), optionally focused on one run. It observes only; it launches nothing.
bunx smithers-orchestrator gui <path> opens the workspace view.
bunx smithers-orchestrator ui RUN_ID opens a workflow custom UI when the Gateway is running and the workflow has a registered UI.
Gateway and custom UI streams expose run state, frames, approvals, node output, and DevTools snapshots for richer visual monitoring.

Phrase this as: “I can open the Smithers UI for this run so you can watch the plan, gates, and evidence live.” Do not phrase it as homework for the human.

Hot validation loop

Use hot mode while authoring or tuning a workflow:

bunx smithers-orchestrator graph workflow.tsx
bunx smithers-orchestrator up workflow.tsx --hot true --input '{"prompt":"..."}'

The graph command validates the rendered shape without executing the whole job. Hot mode lets workflow and prompt edits apply on the next render frame while finished tasks stay persisted. Rules of thumb:

Use --hot true for prompt wording, task body, and non-schema workflow edits.
Restart fresh when output schemas or task ID shapes change.
Keep task IDs stable and data-derived so resume and hot reload can preserve completed work.
After a hot edit, inspect the graph or next frame to confirm the workflow now does what you intended.

Do not treat hot reload as magic. Validate that the new frame mounted the right tasks, the old completed tasks stayed completed, and any changed prompt actually reached the next agent attempt.

Reports for the human

End every substantial Smithers run with a human-readable report. Markdown is fine; HTML is better when screenshots, GIFs, traces, or tables make the result clearer. Write it as an artifact, for example:

artifacts/smithers-report.md
artifacts/smithers-report.html
artifacts/screenshots/
artifacts/gifs/
artifacts/evals/
artifacts/traces/

The report should include:

Summary: what changed, what shipped, and what did not.
Run metadata: workflow name, run ID, branch or worktree, key node IDs.
Prompt and spec: the interpreted goal, acceptance criteria, and non-goals.
Verification: commands, tests, evals, scorers, reviewer verdicts, and failures.
Assumption tests: probes run, outputs captured, and open risks.
Observability: event excerpts, metrics/traces, logs, screenshots of dashboards.
Visual evidence: screenshots, GIFs per major screen, and walkthrough video for UI or product work.
Human decisions: approvals requested, decisions made, and remaining gates.
Next steps: exact options, tradeoffs, and what you recommend.

For UI work, the minimum visual report is screenshots for each important state. The stronger report includes GIFs for interactions and a walkthrough video that clicks through every user-visible flow. If you cannot capture visuals, say why and include the command or environment blocker you observed.

Failure protocol

When a run fails or pauses unexpectedly, stay in the operator role:

Inspect the run with why, inspect, events, node, and logs.
Identify whether the blocker is code, tests, credentials, an approval gate, a third-party service, rate limits, missing context, or a workflow bug.
If it is fixable by you, fix it or resume from the correct frame.
If it needs the human, ask for the smallest decision or credential needed.
Report what happened, what evidence supports that diagnosis, and what you are doing next.

Bad response:

Run smithers inspect and tell me what it says.

Good response:

The run is paused at the deployment approval gate. I inspected the node output:
tests passed, the review approved, and the only remaining action is your
approval to deploy. I recommend approving because the diff is limited to the
rate-limit middleware and the rollback path is unchanged.

The human should feel like they are talking to a careful operator, not like they were handed a control plane manual.

Minimal checklist

Before launching:

Outcome, finish line, and evidence are written down.
Missing context has been researched or asked for.
Third-party assumptions have probes or are explicitly marked as risks.
Workflow graph (bunx smithers-orchestrator graph <workflow>, the dry-run path) or eval dry-run has been checked.
Backpressure checks exist and can fail.
Observability path is chosen.
Report artifact path is chosen.

While running:

Watch the run.
Use the UI when visual state, approvals, or steering would help.
Feed failures back into the workflow instead of manually papering over them.
Keep the human updated in plain English.

Before closing:

Regenerate or collect the final evidence.
Write the report.
Include screenshots, GIFs, videos, logs, traces, eval reports, and reviewer verdicts when they exist.
Explain remaining risk honestly.
Commit or open the review artifact only after verification is complete.

Start

Articles

Learn

Build Workflows

Run and Operate

Workflow Pack

Components

Integrations

Agent Support

Examples

Contributing

Agent Operating Playbook

The operating loop

Translate human prompts into Smithers work

Context engineering

Backpressure verification

Assumption tests

Observability-first runs

Hot validation loop

Reports for the human

Failure protocol

Minimal checklist

​The operating loop

​Translate human prompts into Smithers work

​Context engineering

​Backpressure verification

​Assumption tests

​Observability-first runs

​Hot validation loop

​Reports for the human

​Failure protocol

​Minimal checklist

The operating loop

Translate human prompts into Smithers work

Context engineering

Backpressure verification

Assumption tests

Observability-first runs

Hot validation loop

Reports for the human

Failure protocol

Minimal checklist