0.25.0 - Smithers

Smithers 0.25.0 is the largest release since 0.23.0: roughly 450 commits since 0.24.2. The durable control plane becomes backend-pluggable, the Gateway grows a full sync layer, workflow output reads become typed, agents gain new tool primitives, the real UI enters preview, and a wide correctness and testing sweep hardens the whole spine. The persistence layer now runs on SQLite, PGlite, or PostgreSQL behind a SQL dialect seam, with a one-command bunx smithers-orchestrator migrate and a fail-loud SMITHERS_MIGRATION_REQUIRED gate so a legacy store is never silently abandoned. The Gateway gains read and write RPCs (listDocs, listAccounts, listPrompts, listScores, listMemoryFacts, getSchemaSignature, and ticket CRUD) plus a reactive TanStack DB sync layer with useGateway* hooks, opt-in SQLite-WASM/OPFS persistence, and a cloud Electric sync source. ctx.output, ctx.outputMaybe, and ctx.latest now infer their row types from the table you pass. The @smithers-orchestrator/agents package adds five vendor-neutral tool primitives (grounded web search, generic HTTP, transcription, image generation, and document/OCR parsing), and the CLI folds its tui command into up --interactive. The proof-of-concept UI apps were removed ahead of the real UI, now in preview at ui-preview.smithers.sh. Alongside: the <Sidecar> composite and a family of validation-first audit workflows, the feature-eval-factory and SWE-EVO upgrades, roughly 160 fixes across the CLI, gateway, scorers, OpenAPI, and control-plane, and a testing sweep that proves the exactly-once durability guarantee against the real engine across OS processes.

PostgreSQL, Migrations & the Durable Backend

The durable control plane is no longer tied to bun:sqlite. 0.25.0 adds a backend abstraction (SQLite, PGlite, or Postgres), a one-command migration path that copies legacy run history forward, and a fail-loud gate so an old store is never silently abandoned. New schema-signature and DB-backed docs primitives, plus a SIGKILL/resume durability fault case, round out the backend hardening.

A Smithers run crashing and resuming from its last durable checkpoint

Backend choice resolution with a fail-loud migration gate. resolveSmithersBackendChoice in packages/smithers/src/resolveSmithersBackendChoice.js resolves the storage backend in precedence order (explicit option, then SMITHERS_BACKEND, then backend in.smithers/smithers.config.ts, then the default pglite). When a legacy bun:sqlite store still holds runs and the resolved backend is pglite or postgres with no.smithers/migrated.json marker, it throws SMITHERS_MIGRATION_REQUIRED rather than silently switching stores. createSmithers in packages/smithers/src/create.js rejects an explicit pglite or postgres backend with INVALID_INPUT (pointing at the async openSmithersBackend factory) rather than degrading to bun:sqlite; createSmithersPostgres is the PGlite/Postgres factory.
New migrate command copies legacy run history forward. bunx smithers-orchestrator migrate (apps/cli/src/index.js) copies the legacy bun:sqlite smithers.db into PGlite or Postgres and writes the.smithers/migrated.json marker. It accepts --to pglite|postgres (default pglite), --url for a Postgres connection string, and --keepSqlite (default true), backed by migrateSmithersStore in packages/smithers/src/migrateSmithersStore.js which orders tables so _smithers_runs is copied first and emits per-table progress events. A SmithersError during migration exits with code 4 (other failures exit 1), and a --backend flag is honored on read commands so the SMITHERS_MIGRATION_REQUIRED remediation works.
Postgres SQL dialect translation layer. packages/db/src/dialect.js exports SQLITE and POSTGRES constants plus translatePlaceholders, columnType, translateDdl, quoteIdentifier, beginTransactionSql, and jsonExtractText, so the adapter in packages/db/src/adapter.js branches on internalStorage.dialect === POSTGRES for upserts, RETURNING clauses, and transaction control across both engines from one code path. translatePlaceholders rewrites ? to $1, $2,... while skipping string literals, quoted identifiers, and comments.
Schema-signature versioning for client/server drift detection. getSmithersSchemaSignature in packages/db/src/getSmithersSchemaSignature.js returns the durable schema head (schemaVersion, the leading digits of the latest _smithers_schema_migrations id) plus a sha256 signature over the sorted internal table catalog and per-table components, so clients can gate persistence on the server schema head and detect a same-head table-shape drift.
DB-backed _smithers_docs table for Smithers markdown artifacts. Migration 0018_add_docs creates the _smithers_docs table (its index migrations parse the target table name so a ledger predating the table is handled), served through upsertDoc/upsertDocRow, getDoc, listDocs, and a softDeleteDoc tombstone path in packages/db/src/adapter.js, exposing Smithers docs (tickets/plans/specs/proposals) as durable rows rather than loose files.
New gateway RPCs, an Electric write endpoint, and run-event backpressure. getSchemaSignature and listDocs are added to the gateway RPC surface (packages/gateway/src/rpc/index.ts, typed in packages/gateway-client/src/GatewayRpcTypeMap.ts). packages/server/src/gateway.js imports getSmithersSchemaSignature, serves a POST /v1/electric/write endpoint (handleElectricWrite) that runs the launch through the normal gateway RPC path and returns a null txid, and disconnects a slow streamRunEvents subscriber with a BackpressureDisconnect once its per-subscriber outbound queue overflows (the shared backpressure bounds live in packages/server/src/GatewayExtensions.js). A local Electric stack ships under deploy/electric/ (docker-compose.yml, initdb/001_smithers_min.sql, smoke.ts, README.md), and the work is documented at /cli/overview, /deployment/production-hardening, /rpc/get-schema-signature, and /rpc/list-docs.
Real-engine SIGKILL and resume durability fault case. e2e/faults/case31-real-engine-kill-resume.test.ts spawns a real engine child process against an on-disk SQLite DB, SIGKILLs it mid-node (polling a B.started marker, no fixed-sleep race), then resumes in a fresh process and asserts the run reaches “finished”, each committed node has exactly one output row (no double-commit), and the pre-kill node A appears exactly once in the execution-counter file (exactly-once side effect).

Gateway: New RPCs & the Sync Layer

0.25.0 turns the Gateway into a control plane for live UIs. Workspace surfaces (runs, approvals, crons, memory, scores, tickets, prompts, accounts, docs) each get a typed RPC plus a reactive TanStack DB collection, the same collections can persist to the browser across reloads, and they can be fed either by the local RPC+WS transport or by a hardened cloud Electric proxy with no change to the consuming component.

A live run tree and node inspector served over the Smithers Gateway

New read RPCs expose every workspace surface. listMemoryFacts (scope memory:read), listScores (score:read), listPrompts (prompt:read), listAccounts (account:read), listDocs and getSchemaSignature (run:read) were added to GATEWAY_RPC_DEFINITIONS in packages/gateway/src/rpc/index.ts, bringing the contract to 29 methods (the check-docs gate asserts exactly 29). listAccounts reads the real ~/.smithers/accounts.json registry via @smithers-orchestrator/accounts/listAccounts and redacts secrets over the wire, carrying hasApiKey/hasConfigDir posture flags instead of the plaintext key.
Tickets became a full CRUD surface backed by a docs table. listTickets/createTicket/updateTicket/deleteTicket (scopes ticket:read/ticket:write, with a TicketNotFound 404 declared on update/delete) write to a new _smithers_docs table (migration 0018_add_docs) holding tickets, plans, specs, and proposals with a status column and a deleted_at_ms soft-delete tombstone. SmithersDb gained listDocs/getDoc/upsertDoc/softDeleteDoc in packages/db/src/adapter.js, and listDocs filters tombstones rather than returning them.
Reactive TanStack DB collections and useGateway hooks ship for each surface.* packages/gateway-client/src/sync/gatewayCollectionDefs.ts adds collection defs (crons, memoryFacts, scores, tickets, prompts) keyed by stable identifiers (scores by the composite runId:nodeId:iteration:scorerId, memoryFacts by ${namespace}:${key}, prompts by entryFile), and packages/gateway-react exposes useGatewayCrons, useGatewayMemoryFacts, useGatewayScores, useGatewayTickets, and useGatewayPrompts over the sync-client registry built by createGatewayCollections, so a surface renders live gateway data with no in-app seed. The crons collection rides the existing cronList RPC.
Collections persist across reloads via SQLite-WASM and OPFS. packages/gateway-react adds opt-in client persistence: createGatewayPersistence builds a PersistentCollectionStore over createSqliteWasmBackend, which uses an OPFS SAHPool VFS (durable without cross-origin isolation or SharedArrayBuffer). withPersistence hydrates a collection’s first sync commit synchronously from cache (no fetch flash) and writes through live changes, and a schemaVersion mismatch drops the whole cache. @sqlite.org/sqlite-wasm is injected by the consuming bundler and is only a devDependency of gateway-react, so the package carries no hard wasm runtime dependency.
A cloud Electric sync source feeds the same collections behind a switch. createElectricCollection<TRow,TKey> in packages/gateway-client/src/sync/createElectricCollection.ts produces the same TanStack DB CollectionConfig (matching collection-id fingerprint and getKey) from an ElectricSQL shape (GET /v1/shape?table=… snapshot plus live long-poll tail) instead of the RPC+WS transport. createGatewayCollections in packages/gateway-react gains a syncSource of "gateway" or "electric"; it engages Electric only when the source is electric AND an electric config is present, otherwise it falls back to the gateway path. ShapeStream is loaded by dynamic import("@electric-sql/client") so gateway-only bundles tree-shake it out.
The new electric-proxy package enforces grant-scoped shapes. @smithers-orchestrator/electric-proxy fronts a real electricsql/electric service (SMITHERS_ELECTRIC_URL) with auth, scope, and grant-based shape filtering, deriving each caller’s granted run ids from the gateway (SMITHERS_GATEWAY_URL). A run- or workspace-scoped shape with no concrete grant fails closed. Run it via the smithers-electric-proxy bin (port SMITHERS_ELECTRIC_PROXY_PORT, default 8443); the package exports createSmithersElectricProxy, serveSmithersElectricProxy, smithersElectricShapeCatalog, a metrics factory, and an observer.
Run-event streaming is bounded by backpressure on both ends. In packages/server/src/gateway.js the server caps each subscriber’s outbound queue at RUN_EVENT_STREAM_OUTBOUND_QUEUE_LIMIT (1000 frames) drained against the WS socket’s bufferedAmount, disconnecting a slow consumer with a BackpressureDisconnect error and incrementing the new gatewayRunEventBackpressureDisconnectTotal metric. On the client, packages/gateway-client/src/sync/createGatewayCollection.ts replaces its unbounded buffer with a bounded queue that sheds the oldest unapplied frame past maxBufferedFrames (default max(maxRows, 1024)).
Docs sync from disk into the DB via a file watcher. packages/engine adds syncDocsFromDisk, createDocWatcher, and startDocFileSync so tickets, plans, specs, and proposals under .smithers/ are mirrored one-directionally (last-write-wins on a content_hash mismatch) into _smithers_docs, exposed through the listDocs RPC. Separately, the bunx smithers-orchestrator gui shortcut and the bunx smithers-orchestrator <dir> directory shortcut now route to the Gateway UI instead of launching the native macOS app (#435).

The Real UI, Removed POC UI, and a Green Pipeline

The product UI moved to its own repository, so 0.25.0 deletes the local POC UI apps that were dead weight and never passed CI on a clean, browser-free box. This release also pays down a large dead-code backlog and hardens the build, test, and publish gates so a single CI run reports every failing package at once.

The upcoming Smithers UI: a home for your coding agents

The upcoming UI time-travel scrubber for rewind, replay, and fork

The upcoming UI diff review of an agent run

The proof-of-concept UI apps are gone. Removed apps/smithers (the Cerebras PWA POC), apps/smithers-studio-2 (studio shell POC), and apps/smithers-demo in one commit, then apps/smithers-tui-demo in a follow-up once its only remaining importer (smithers-demo) was deleted. The app-e2e CI job that ran their browser e2e (it installed playwright/chromium on the runner, which contradicts the repo’s clean-box, no-browser CI philosophy) was dropped, along with the dev, dev:studio, demo:smithers, demo:smithers:tui, and test:e2e:apps scripts and their package-configuration.mdx rows. All four apps were "private": true and were never published to npm. The real UI is previewable at https://ui-preview.smithers.sh.
A 1,764-line legacy engine body was deleted. runWorkflowBodyLegacy in packages/engine/src/engine.js was the pre-driver implementation, reachable only via the __smithersEngineMode === "legacy" option and the SMITHERS_LEGACY_ENGINE=1 env var, neither set in production nor documented anywhere. runWorkflowBody now passes straight through to runWorkflowBodyDriver, dropping engine.js from 7,775 to 6,011 lines at the time of the change, with the full 646-test engine suite green. The engine’s legacy-mode tests still pass that option, so it is now a silent no-op that resolves to the driver path.
A dead-code audit removed roughly 36 orphaned modules. Audit ticket 0048 (#301) deleted files with zero product or test consumers across many packages: the in-memory db storage module, duplicate db output/ and frame-codec/ barrels, the protocol ProtocolError and outputs.ts types, scorers and memory react-types.ts, the smithers src/ide/ subtree (16 files), vcs WorkspaceSnapshot.ts, the non-durable engine deferred-bridge.js, and the cli serveSemanticMcpServer wrapper function, each verified by grep plus per-package typecheck and test runs.
The CI test job no longer hides a backlog of failures. The root test script now runs pnpm -r --no-bail test so one CI run reports every failing package at once instead of stopping at the first (which had been masking failures behind packages/smithers when rg was missing). Ripgrep is now installed on the runner (apt-get install -y ripgrep) so packages/smithers’ grep tool tests do real searches, and pnpm lint (oxlint), typecheck:examples, and typecheck:evals are now enforced gates per audit 0047 (#300).
CI gained a workspace coverage report. A new pnpm coverage step (scripts/coverage.mjs) runs in the test job and uploads a coverage-workspace artifact, and selected examples/ bun tests (the porting-rules and context-handoff suites) now run in CI so the structural porting rules are gated.
Publish gains an LLM-backed feature-doc-sync gate. The release workflow (.smithers/workflows/release.tsx) runs a tool-enabled feature-doc-sync agent (agents.smartTool) that, per.smithers/prompts/feature-doc-sync-audit.mdx, diffs from the last release tag (git describe --tags --match 'v*' --abbrev=0) and verifies every new or changed user-facing feature is recorded in both .smithers/specs/features.ts and docs/, then a feature-doc-sync-gate task throws an actionable message listing the missing entries on drift. A skipFeatureDocSync input flag bypasses it intentionally, mirroring the existing changelog and marketing checks.
The release drift guard is now robust to non-deterministic declarations. scripts/publish.mjs previously failed the post-build drift check on any changed file, but rollup-plugin-dts emits non-deterministic *.d.ts for large packages so a committed copy can never byte-match a fresh build. The guard now excludes *.d.ts (the same pnpm -r build regenerates every declaration immediately before pnpm publish packs the tree) while still failing on stale deterministic artifacts like openapi.yaml, the llms bundles, and the seeded workflow pack.
The release changelog now diffs since the previous release tag. probeRelease in.smithers/lib/release-content/git.ts diffed <latest tag>..HEAD, which is empty once pnpm version has tagged the new version at HEAD, producing a nearly blank 0.25.0 changelog despite ~450 commits since v0.24.2. Its new previousReleaseTag helper treats pkg.version as the target when v<pkg.version> already exists and diffs from the highest vMAJOR.MINOR.PATCH tag strictly below it.

Engine, Typed Outputs & Observability

This release sharpens the durable control plane from the type layer down to the metrics. Workflow output reads are now strongly typed against their table schema, the engine parks and revives runs around provider quota limits instead of failing them, and the observability surface stops reporting bogus model ids and unbounded event streams.

Typed output rows from the table argument. ctx.output, ctx.outputMaybe, and ctx.latest previously returned an untyped OutputRow, so reads like ctx.outputMaybe(outputs.research,...).summary resolved to unknown and .smithers workflows failed tsc with about 85 errors. A new ResolveOutputRow<Schema, T> helper (packages/driver/src/ResolveOutputRow.ts) plus per-method overloads in packages/driver/src/SmithersCtx.js now infer the row from a string table name (keyed into the workflow Schema), a Zod schema object (z.infer), or a Drizzle table ($inferSelect), falling back to OutputRow for widened or unknown args so loosely-typed callers are unchanged.
Quota-aware pause and resume. Runs that hit a provider quota limit now park as waiting-quota (added to RunStatus in packages/driver/src/RunStatus.ts, DB_RUN_ALLOWED_STATUSES, and the suspending-status checks) instead of failing, persisting quotaBlockedCount and resetAtMs in the run errorJson and surfacing them through deriveRunState (packages/db/src/runState/deriveRunState.js) as blocked.kind = quota. classifyQuotaError in packages/agents/src/BaseCliAgent/BaseCliAgent.js was hardened to parse ordinal dates and retry-after-N-seconds windows across all eight quota patterns, and the new status is documented in /reference/types and /reference/errors.
.smithers docs synced to the DB by an optional file watcher. New syncDocsFromDisk, createDocWatcher, and startDocFileSync (packages/engine/src/*) mirror tickets, plans, specs, and proposals under .smithers into the _smithers_docs table, gated behind SMITHERS_DOCS_FILE_SYNC=1. Paths are validated to stay inside .smithers and dropped edits emit a docs.watcher.dropped warning under the docs:file-sync log span. The gateway also serves a matching listDocs read RPC and a getSchemaSignature RPC, both defined in packages/gateway/src/rpc/index.ts and dispatched by packages/server/src/gateway.js.
Re-render trigger reasons for frame observability. RenderContext now carries a trigger with a RenderTriggerReason (task-finished, timer-fired, cache-resolved, loop-advanced, deadlock-check, stability-check) threaded through makeWorkflowSession (packages/scheduler/src/RenderContext.ts, makeWorkflowSession.js), so every re-render frame records why it fired. The new requireRerenderOnOutputChange flag on RunOptions (packages/driver/src/RunOptions.ts) opts a run into per-output-change re-renders.
Honest per-model token attribution. TokenUsageReported computed the model as effectiveAgent.model ?? effectiveAgent.id ?? "unknown", so SDK agents (which leave model unset) and unnamed CLI agents fell through to a random UUID id, making per-model cost impossible and exploding the metric model tag cardinality. The engine now prefers the authoritative result.response.modelId (set for both SDK and CLI agents) and refreshes the agent span tag attemptMeta.agentModel from the same resolved id (packages/engine/src/engine.js).
Run-event stream backpressure disconnects. The gateway now bounds each subscriber’s outbound run-event queue at RUN_EVENT_STREAM_OUTBOUND_QUEUE_LIMIT = 1000 frames; a slow consumer that overflows is disconnected with a BackpressureDisconnect error and increments the new smithers.gateway.run_event_backpressure_disconnect_total counter (packages/server/src/gateway.js, apps/observability/src/metrics/gatewayRunEventBackpressureDisconnectTotal.js).
Correlation-context footgun fixed and dead code removed. withCorrelationContext (apps/observability/src/_coreCorrelation/withCorrelationContext.js) must run via Effect.runPromise/runFork, never runSync, because its AsyncLocalStorage.enterWith otherwise leaks ALS async-hooks onto the caller’s context; this is now documented and exercised that way. The obsolete legacy engine body (1764 deleted lines in packages/engine/src/engine.js) and the dead in-memory MetricsService implementation (makeInMemoryMetricsService in apps/observability/src/_coreMetrics.js) were removed as part of audit 0048.

Agents & New Tools

0.25.0 turns the agents package into a real toolbox. Five new AI SDK-compatible tool factories ship in packages/agents/src, all built on a provider-boundary pattern so a workflow never hard-codes a single vendor, and the agent runtime itself gets preflight checks, quota-aware suspension, and honest error reporting.

Smithers running any coding agent behind the same durable workflow engine

Grounded multi-provider web search. createGroundedWebSearchToolset({ providers, maxResultsPerProvider }) (exported from @smithers-orchestrator/agents, source at packages/agents/src/web-search/createGroundedWebSearchToolset.js) exposes a single grounded_web_search tool that fans a query across Exa semantic retrieval plus a fresh/SERP provider (Tavily, Brave, or Serper, built via createExaSearchProvider/createTavilySearchProvider/createBraveSearchProvider/createSerperSearchProvider). It throws unless given Exa as the semantic provider plus at least one fresh provider, runs them with Promise.allSettled so one failure does not sink the call, dedupes by normalized URL (hash stripped), and returns numbered citation results with a freshness filter of day/week/month/year (#319).
Generic HTTP escape-hatch tool. createHttpTool(options) (packages/agents/src/http/createHttpTool.js, re-exported from the smithers-orchestrator facade at packages/smithers/src/index.js) lets an agent call any REST endpoint with no OpenAPI spec. The Zod input accepts method, url, headers, query, body, an optional timeoutMs (wired to an AbortController), and a discriminated auth union keyed on type with bearer, basic, or header variants; non-string JSON bodies auto-set content-type: application/json, and it returns { ok, status, statusText, headers, body } (#318).
Audio transcription via Whisper or Deepgram. createTranscriptionTool({ provider, apiKey, model?, baseUrl? }) (packages/agents/src/transcription/createTranscriptionTool.js) accepts either an audioUrl or audioBase64 (with mimeType, language, and prompt hints) and normalizes both providers to { text, provider, language?, durationSeconds? }. Whisper defaults to model whisper-1 against https://api.openai.com/v1/audio/transcriptions; Deepgram defaults to nova-3 with smart_format against https://api.deepgram.com/v1/listen (#313). Documented at docs/integrations/sdk-agents.mdx.
Image generation and document/OCR parsing primitives. createImageGenerationTool(provider, options) (packages/agents/src/image-generation/createImageGenerationTool.js) wraps a pluggable ImageGenerationProvider.generateImage(request) behind a generate_image tool that takes prompt, model, size, count, seed, and style. createDocumentParsingToolset(options) (packages/agents/src/document-parsing/createDocumentParsingToolset.js) exposes a parse_document tool whose source accepts url/base64/text and turns documents into text/markdown via a Firecrawl default provider, with mistral-ocr and llamaparse selectable by string or a custom provider object (#317, #320).
Gemini CLI agent sunset for Antigravity, plus agent preflight. AntigravityAgent (packages/agents/src/AntigravityAgent.js) is added and exported from the package and the facade; GeminiAgent is marked @deprecated (the export still ships) and the gemini subscription provider is dropped from the accounts provider union. BaseCliAgent.preflight runs launchDiagnostics before the first generation; the engine invokes it once per agent per run (cached via a WeakMap in runAgentPreflightOnce) and a failed check throws a non-retryable AGENT_CONFIG_INVALID error (#443).
AmpAgent session resume. AmpAgent’s buildCommand (packages/agents/src/AmpAgent.js) now emits amp threads continue <id> at the front of its args when a task passes options.resumeSession or this.opts.resume, and skips the new-thread-only flags --visibility and --archive on resume; the new resume option lands on AmpAgentOptions (#302).
Quota-aware pause and resume. The engine, scheduler, and agents now classify provider quota/rate-limit errors and park a run in a new waiting-quota status (packages/driver/src/RunStatus.ts) instead of failing it. classifyQuotaError covers eight regex patterns (including retry-after-N-seconds and ordinal reset dates) and emits AGENT_QUOTA_EXCEEDED; the engine persists quotaBlockedCount and resetAtMs into the run’s errorJson, and deriveRunState surfaces a blocked.kind: "quota" reason so the run resumes once quota resets.
Real provider errors instead of a fake JSON-schema failure. streamResultToGenerateResult.js previously dropped the AI SDK error stream part, letting a masking NoOutputGeneratedError propagate so a bad model id looked like the agent returning invalid JSON for the declared output schema. It now captures the first stream error part (and consumeStream’s onError) and re-throws the genuine provider error, so a 404 bad-model request fails with the real provider APICallError.

Workflows & Components

This release pushes on the orchestration surface: a new composite component for measuring whether a cheaper model is good enough, a set of long-running workflows that prove their own work instead of self-reporting, and a round of correctness fixes across the structural components so authoring mistakes fail loud at compile or graph time rather than silently dropping tasks.

New <Sidecar> composite for cheap-model shadow scoring. <Sidecar agent={...} sidecar={...} output={...} scorers={...}> renders a <Parallel> (id ${id}-parallel) wrapping two <Task> children over the same prompt: the primary keeps the component id so downstream needs can consume it, while the shadow task runs at ${id}-sidecar with continueOnFail: true and writes its own scorer rows. The companion computeSidecarDelta(rows, { primaryNodeId, sidecarNodeId, scorerId }) reads persisted scorer rows and derives primaryScore, sidecarScore, delta, and a cheaperWins boolean (true when the sidecar score is at least the primary). Both are exported from smithers-orchestrator; see /components/sidecar.
audit-burndown, a multi-week validation-first workflow that proves its own work. .smithers/workflows/audit-burndown.tsx burns down .smithers/tickets/smithers/*.md one open - [ ] checkbox at a time in isolated <Worktree> nodes under a bounded <Parallel>. The authoritative gate is a deterministic completeness-oracle compute task that, after each batch merges to local main, runs the real pnpm typecheck then pnpm test on the real main checkout (agents cannot fake a spawnSync exit code) and writes a completeness row queryable via bunx smithers-orchestrator node oracle -r <run>. An outer <Loop> stops only when the open count hits zero, the full gate is green, and a push fence confirms no agent moved origin.
bulletproof-audit and audit-fix-train for production-readiness scoring. .smithers/workflows/bulletproof-audit.tsx runs one read-only Codex (GPT-5.5) auditor per feature group in .smithers/specs/features.ts, scoring each across 10 dimensions (e2e, unit, observability, architecture, JSDoc, docs, durability, type-safety, security, evals), then a deterministic writer emits .smithers/audits/bulletproof-audit.md. .smithers/workflows/audit-fix-train.tsx turns that backlog into landed code: a jj-native plan, implement, review loop per finding, serialized through a <MergeQueue maxConcurrency={1}> onto local main, with a final jj fetch, rebase-onto-origin, and push (the only workflow here that pushes, gated behind a push input).
coverage-codex-swarm and plan-implement-review-issues join the init pack. .smithers/workflows/coverage-codex-swarm.tsx fans out per-package coverage work in isolated worktrees and now accepts an optional packageInstructions record (z.record(z.string, z.string)) so callers thread per-package guidance into each prompt without editing the workflow. .smithers/workflows/plan-implement-review-issues.tsx discovers every open GitHub issue, treats each unchecked - [ ] checkbox as a work item, drops items that already have an open PR, groups related items, and opens exactly one deduped PR per group with Closes #N / Relates to #N linkage, with an agent-array rate-limit failover chain on every role (plan, implement, validate, review, PR).
Interactive run mode replaces the standalone tui command. The tui command is removed; its clack picker plus live status card now runs behind bunx smithers-orchestrator up --interactive and bunx smithers-orchestrator workflow run --interactive. A bare up or workflow run with no workflow arg on a TTY also launches it, while passing a path or ID preselects and skips the picker. Non-TTY --interactive fails with INTERACTIVE_REQUIRES_TTY (exit 4); a missing arg without a TTY fails with WORKFLOW_REQUIRED; -i stays bound to --input.
Structural components now fail loud instead of silently dropping work. <Branch> resolves its subtree from then / else and never read children, so <Branch>...</Branch> used to drop those tasks silently; it now throws an INVALID_INPUT SmithersError and types children?: never for a compile-time error. SagaStep is now a real named value export (so import { SagaStep } works, alongside <Saga.Step>). The kanban workflow’s implement prompt and result task now commit work on the worktree branch before merge, fixing a bug where converged work was dropped as already up to date (only ~17% of tickets landed).
<SuperSmithers> applies edits for real, and composite docs embed verifiable source. <SuperSmithers> previously left its apply path as a no-op compute that returned a literal { applied: true } without writing anything; the non-dry-run path now directs the agent to make file edits on disk with its editing tools, and the compute step is demoted to an honest dependency barrier ahead of the report. Separately, scripts/generate-component-source.mjs (run via pnpm docs:components) injects a tabbed ## Source <CodeGroup> into each composite component doc, delimited by GENERATED:COMPONENT-SOURCE markers and stripped from the llms-*.txt bundles.

CLI: Interactive Runs, Migrate & Memory

Smithers 0.25.0 reshapes the run verbs around an interactive picker, hardens the SQLite-to-PGlite migration path, and rounds out cross-run memory. The CLI now reads a paused run as a decision point instead of a failure, and tears runs down cleanly on signal.

--interactive replaces the standalone tui command. bunx smithers-orchestrator up --interactive and bunx smithers-orchestrator workflow run --interactive launch the clack flow (fuzzy workflow picker, input prompts, live status card with inline approval/human gates, gate-then-resume), and a bare up/workflow run with no workflow arg on a TTY launches it too. Passing a workflow path (up) or ID (workflow run) preselects it and skips the picker. Non-TTY --interactive fails with INTERACTIVE_REQUIRES_TTY (exit 4); -i stays bound to --input. The runner lives in apps/cli/src/tui.js (runTuiCommand).
Approval and human gates resolve inline inside the interactive card. apps/cli/src/tui-gates.js drives handleApprovals and handleHumanRequests via clack select/confirm/text, calling approveNode/denyNode from @smithers-orchestrator/engine/approvals through Effect.runPromise with the smithers:tui source. It re-checks each approval is still pending before deciding (so out-of-band resolutions are skipped) and leaves the run paused if you cancel the approval prompt.
smithers migrate copies the legacy bun:sqlite store into PGlite or Postgres. bunx smithers-orchestrator migrate (with --to pglite|postgres, default pglite, and --keepSqlite, default true) imports migrateSmithersStore from smithers-orchestrator, streams per-table [smithers] migrated <table>: <targetRows>/<sourceRows> rows progress to stderr, and writes the migrated.json marker. Read commands in apps/cli/src/find-db.js now run assertSmithersReadBackend before opening the legacy store, which calls resolveSmithersBackendChoice so a pglite/postgres resolution with no marker fails loud instead of silently reading stale SQLite.
--backend is honored on read commands so the migration remediation actually works. up/gateway/monitor/workflow run register --backend sqlite|pglite|postgres, but read commands (ps, inspect, output, …) did not, so the SMITHERS_MIGRATION_REQUIRED error’s own --backend sqlite advice was rejected as an unknown flag. extractBackendFlag in apps/cli/src/argv-utils.js now lifts --backend <value> (or --backend=value) out of argv on any command and sets SMITHERS_BACKEND so the resolver honors it everywhere.
bunx smithers-orchestrator memory gains get, set, and rm. The memory group previously only had list; get, set, and rm now wrap the memory store’s getFact/setFact/deleteFact over a parsed namespace plus key (e.g. workflow:my-flow), with set taking an optional --ttl in milliseconds. get prints the fact’s valueJson and reports a missing fact cleanly. Failures surface as MEMORY_GET_FAILED/MEMORY_SET_FAILED/MEMORY_RM_FAILED. Code in apps/cli/src/index.js (#302).
A paused up now prints decision CTAs instead of generic failure hints. When up exits 3 in a waiting-approval/waiting-event/waiting-timer state, the result block prepends a ‘Run is paused (exit 3 = awaiting a decision, not a failure)’ header plus pauseCtas: approve/deny/why for approvals, signal/why for events, and why for timers, ahead of the usual inspect/logs CTAs.
SIGINT and SIGTERM now durably cancel the run. setupSqliteCleanup previously registered process.on("SIGINT"/"SIGTERM") handlers that called process.exit(130/143) before the graceful abort, leaving the run status:"running" with a frozen heartbeat until the 30s stale lease (RUN_HEARTBEAT_STALE_MS) expired. Those signal handlers are dropped (process.on("exit", closeSqlite) still cleans up), so the abort path runs and durably writes status:"cancelled", with a second-signal/5s-unref force-exit backstop in setupAbortSignal and the gateway shutdown.

Evals & Benchmarks

This release turns the eval corpus into something that generates and repairs itself, and hardens the SWE-EVO benchmark into a full-suite, crash-safe run with honest scoring on a real x86 backend. Both tracks ship as runnable Smithers workflows you can drive end to end.

The feature-eval-factory workflow authors and optimizes the whole eval corpus. evals/feature-eval-factory.tsx is a durable two-phase Smithers workflow. Phase 1 fans out one authoring agent per FeatureGroup parsed from .smithers/specs/features.ts (a rotated mix of AntigravityAgent, CodexAgent, and ClaudeCodeAgent with array failover, where Antigravity only leads when the agy CLI is on PATH) that writes non-trivial, multi-feature eval tasks to per-group shards under evals/_inventory/generated/<GROUP>.jsonl, then a single merge step folds them into evals/_inventory/curated-tasks.jsonl and regenerates the per-suite cases.jsonl via evals/harness/generate-cases.ts. Run it with bunx smithers-orchestrator up evals/feature-eval-factory.tsx --detach.
Phase 2 closes the loop by fixing docs until a weak model one-shots the tasks. A <Loop> runs scoped suites against a weak candidate model (default haiku), scores pass-rate plus one-shot-rate, aggregates the worst features and the agents’ own friction themes, and hands them to a fixer agent (Codex-led) that edits the root-cause docs/*.mdx and regenerates the bundles with pnpm docs:llms, iterating toward the targetPass 1.0 / targetOneShot 0.99 goal.
The feature-coverage corpus now spans every feature group. The factory generated source tasks across the non-empty feature groups in .smithers/specs/features.ts, expanding curated-tasks.jsonl and seeding new authoring and knowledge suites under evals/suites/ (including knowledge-gateway, authoring-workflows, authoring-agents, and authoring-approvals).
SWE-EVO got a decoupled native-x86 scorer for emulation-incompatible instances. Some instances do not reproduce their gold patch under Docker emulation on Apple Silicon, so examples/swe-evo/score-x86.ts provisions a native x86 Linux + Docker VM on Freestyle, uploads the self-contained harness, and scores candidates (or the gold patch with --gold) one image at a time, pruning between images to respect the 32 GB plan cap. Agents still generate patches on the Mac (where the CLIs and auth live); extract-candidates.ts pulls those diffs out of the run DB for re-scoring. A follow-up fix moved to one fresh VM per instance so a single oversized image can no longer exhaust the rootfs and poison later instances.
A rate-limit-aware suite supervisor runs the full benchmark to completion. examples/swe-evo/run-suite.ts supervises at the instance level: it recomputes the remaining unscored targets from the DB each round (so it is crash-safe and idempotent), pauses with escalating backoff capped at the Claude 5-hour window on a global block, and drops an instance only after --max-instance-retries (default 3) genuine failures. Flags include --subset, --all, --concurrency, and --workflow panel|baseline.
A Panel + ReviewLoop orchestration variant measures what multi-agent structure buys. examples/swe-evo/workflow/swe-evo-panel.tsx replaces the flat two-agent baseline with a <Panel> of three parallel planners (Opus + Codex + Gemini) synthesized by an Opus moderator, then a <ReviewLoop> where Codex (gpt-5.5) implements against the plan while Opus and Gemini review until approved. Fairness is preserved: no agent sees the hidden tests or the gold patch.
Results compare against the official SWE-EVO leaderboard. examples/swe-evo/report.ts computes Resolved Rate and Fix Rate per repo and diffs runs against the public leaderboard in official-results.json (sourced from arXiv:2512.18470 Table 2), and make-showcase.ts builds a self-contained HTML page of per-repo and per-instance coverage across all 48 official instances. The final showcase reports 34/48 scored and 23 resolved (67.6% Resolved Rate over the scored set), with native-x86 scores overriding non-reproducible Mac sentinels for the x86 instances.

Reliability & Correctness

Roughly 160 fixes landed this release. The notable, user-impacting ones, grouped by area:

CLI, Review & e2e

A large fix pass this cycle hardened the parts users touch every day: clean shutdown and durable run state in the CLI, scoped action tokens and time-travel tools over MCP, an OIDC verification gap and atomic spend caps in the review proxy, and a rewrite of the fault e2e suite so the tests exercise real backends instead of fabricated stand-ins.

Runs now mark themselves cancelled on Ctrl-C instead of dangling as running. setupSqliteCleanup in apps/cli/src/index.js had registered SIGINT/SIGTERM handlers that called process.exit before the graceful abort path, so the run’s status:"cancelled" write never happened and the run sat at running with a frozen heartbeat until its lease went stale. Those handlers are dropped (the process exit event still closes SQLite), and setupAbortSignal now aborts gracefully, force-exits on a second signal, and force-exits via a 5s unref’d timer backstop so a hung shutdown never needs kill -9.
The --backend sqlite migration remediation actually works now. --backend was a registered option only on up/gateway/monitor/workflow, so read commands such as bunx smithers-orchestrator ps and inspect rejected it as an unknown flag, which was the exact remediation the migration hard-fail told users to run. A new extractBackendFlag in apps/cli/src/argv-utils.js lifts --backend <value> (and --backend=value) out of argv for every command outside the NATIVE_BACKEND_COMMANDS set and sets SMITHERS_BACKEND so the store resolver honors it everywhere.
bunx smithers-orchestrator gateway keeps stdout clean and exits 0 on shutdown. The long-running Gateway only resolved after SIGINT/SIGTERM, at which point c.ok(result) printed a completion descriptor plus a skills CTA to stdout, breaking the empty-stdout contract that autoStartGateway relies on (it polls /health over HTTP and reports status on stderr). The shutdown path now calls process.exit(0) without emitting a trailing result frame.
Scoped action tokens can be brokered without breaking gateway auth. apps/cli/src/token-store.js gains issueSmithersBrokerToken (issues a brokered handle alongside the main bearer, scoped to a single actionId), resolveSmithersActionTokenFromStore (validates the handle’s scopes/expiry/revocation and returns the raw bearer without storing it at rest), and revokeSmithersToken with a capped audit trail. The store bumps to version 2 with normalizeStore for safe forward/backward reads, and the CLI wires bunx smithers-orchestrator token issue --actionId/--revealToken plus a new token exec subcommand that resolves a handle and injects the bearer into a child process env var (default SMITHERS_API_KEY) (#321).
The review proxy meters the full SSE body and enforces spend caps atomically. apps/review/src/server/proxy/handleAnthropic.ts dropped a 1 MiB accumulation cap (const cap = 1 << 20) in teeForMetering so large streamed responses are fully read before usage is parsed, and apps/review/src/server/proxy/recordUsage.ts now does a conditional UPDATE sessions SET spent_usd = spent_usd + ? WHERE hash = ? AND spent_usd + ? <= spend_cap_usd before inserting the usage_events row, returning recorded: false when the cap would be exceeded so spend can never be double-counted. A pre-request check in handleAnthropic.ts still rejects with HTTP 402 when the session is already over its cap (#439).
OIDC verification no longer accepts a mismatched single-key JWKS. apps/review/src/server/sessions/verifyOidc.ts had a single-key fallback (?? (keys.length === 1 ? keys[0] : undefined)) that used a JWKS key even when its kid did not match the token header’s kid, so an attacker who could rotate their own JWKS to contain exactly one key could bypass the kid check. verifyOidc now strictly requires k.kid === header.kid and returns { ok: false, reason: "unknown-key" } when nothing matches (#386).
Server rerun loads input portably instead of via raw SQL. packages/server/src/gateway.js previously reran with a hardcoded SELECT payload FROM input WHERE run_id = ? against a raw better-sqlite3 client, which broke for schema-backed workflows whose input table name differs and mishandled the payload envelope. It now derives the real table name via resolveSchema from @smithers-orchestrator/engine, fetches the row with loadInput from @smithers-orchestrator/db/snapshot, and a new normalizeRerunInput strips runId and unwraps the payload envelope before starting the new run (#330).
The full time-travel surface is exposed over the semantic MCP, with typed errors preserved. apps/cli/src/mcp/semantic-tools.js and SemanticToolName.ts now register fork_run, replay_run, rewind_run, restore_checkpoint, list_snapshots, get_timeline, and time_travel (previously only revert_attempt was wired). Separately, the layer stopped re-wrapping typed SmithersErrors as generic INTERNAL_ERROR: executeSemanticTool switched Effect.tryPromise to its two-argument { catch: (error) => error } form so the original error passes through and codes like WORKFLOW_MISSING_DEFAULT, RUN_NOT_FOUND, and CLI_DB_NOT_FOUND reach the caller (#433, #409).
The gui shortcut routes to the Gateway UI, and replay surfaces VCS restore errors. The gui command and the bunx smithers-orchestrator <dir> shortcut now route to the Gateway UI via runUiCommand, with rewriteGuiShortcutArgv using a statSync.isDirectory guard so only directories trigger the route (#435); and replay now surfaces VCS restore errors through a dedicated apps/cli/src/reportReplayResult.js that prints result.vcsError instead of leaving the user with a replay that silently did nothing (#388).
Fault e2e cases now drive real backends instead of fabricated stand-ins. The mock-SQLite and hand-rolled WebSocket fakes were replaced with real engine/Gateway paths: case03 starts a real Approval run and submits via the /v1/rpc/submitApproval HTTP endpoint (#412), case14/case17 boot a real Gateway for RPC round-trips and bad-HMAC webhook signatures, dropping ~1256 lines of mock infrastructure (#415), and case12 (rewind/VCS) and case26 (diff review) were rewritten through the real product path (#416, #417). case14 also learned to wait for the node to park at waiting-approval (not just the approval row flipping to requested) to avoid a 500 race on submit.
Smaller fixes. Agent detection walks PATH with accessSync(join(entry, binary), constants.X_OK) instead of shelling out to /bin/bash -c "command -v" (#444); the cron scheduler spawns the real entrypoint via process.execPath and an import.meta.url-resolved absolute path rather than bun run src/index.js (#422); snapshots/timeline JSON output is emitted through writeStdoutSync (so a piped reader never truncates it) and wrapped in a { timeline:... } envelope, with a -j alias for --json (#427); the TUI re-polls pending human requests when only an approval is visible so a HumanTask routes to its real question ; OpenAPI tool generation imports createOpenApiToolsSync from the public smithers-orchestrator/openapi entry (#441); apps/review/src/github/runGh.ts was made CI-deterministic with a checked-in executable apps/review/tests/github/fixtures/fake-gh, env: process.env propagation, node:child_process spawnSync, and a SMITHERS_GH_BIN override ; and the workflow-ui e2e resolves Playwright via require("playwright") from apps/cli’s own devDependency rather than the retired studio-2 POC.

Gateway, DB, OpenAPI & Control-Plane

0.25.0 lands 30-plus fixes across the durable control plane. The headline repairs are data-integrity ones in packages/db and the cloud-sync path, plus trust-boundary hardening in the generated OpenAPI HTTP tools.

Closed a concurrent sequence-allocation race in the event/signal log on the Postgres and pglite backends. The non-bun:sqlite fallback in packages/db/src/adapter.js (insertEventWithNextSeq / insertSignalWithNextSeq) read MAX(seq) then insertIgnore’d at lastSeq+1 with no serialization, so two concurrent writers collided on the (run_id, seq) primary key and insertIgnore silently dropped the loser, losing an event or signal from the ordering backbone that deterministic replay, live-stream tailing, and reconnect-after-seq all depend on. The fallback now acquires the same per-run transaction turn the bun:sqlite path uses, making read-MAX-then-insert atomic; new tests assert 50 concurrent event inserts and 40 concurrent signal inserts stay gapless with no drops or dups.
User output schemas can no longer collide with reserved key columns. Output tables reserve run_id/node_id/iteration and input tables reserve run_id, but schema fields were appended to the same namespace unchecked, so a field named nodeId, runId, or iteration either crashed DDL with a raw “duplicate column name” (SQL path) or silently overwrote the internal NOT NULL key column and corrupted the composite primary key (drizzle path). A new context-aware assertNoReservedColumns guard (packages/db/src/assertNoReservedColumns.js), wired into zodToTable and zodToCreateTableSQL, now fails fast with an INVALID_INPUT error naming the offending field and suggesting a rename.
Electric live deletes are now applied instead of being silently dropped. An Electric delete change message carries only the primary-key columns ({ namespace, key }), never value_json, but createElectricCollection in packages/gateway-client routed the delete value through mapRow, which requires the non-PK columns and returned undefined, so every live delete was dropped and the row was stranded in the collection forever (the cloud live tail never removed a Postgres-deleted fact). A new optional getKeyFromRaw on the collection def derives the delete key directly from its PK columns; memoryFacts supplies one over namespace/key. A companion fix reconciles the Electric snapshot on snapshot-end rather than on every up-to-date frame.
OpenAPI generated tools no longer let the model overwrite injected auth, and non-2xx responses now surface as errors. Two trust and correctness bugs in the HTTP tool executor (packages/openapi/src/tool-factory/_helpers.js): an LLM-supplied header parameter could override the integrator’s injected auth header (the model could overwrite Authorization), and a non-2xx response was returned as a success result. Injected auth headers are now spread after the model’s header params so the operator secret always wins, and a non-2xx response throws an error carrying status and the response body (the error never embeds the RequestInit, so the injected header cannot leak).
Several more OpenAPI parsing and execution fixes. buildUrl now uses replaceAll so a path template with a repeated param (e.g. /orgs/{id}/aliases/{id}) substitutes every occurrence (#429); non-JSON request bodies are supported in buildOperationSchema and the executor (#431); request-body argument-name collisions with other params are resolved via getRequestBodyArgName in both schema and executor (#440); jsonSchemaToZod now enforces numeric and typeless enums (#408) and validates typed additionalProperties via catchall (#420); and the curated-tools factory rejects duplicate tool names with a descriptive error instead of silently overwriting entries (#315).
Control-plane validation now rejects junk instead of storing it. setUsageLimit and checkUsageLimit in packages/control-plane/src/index.js previously accepted any non-empty period string; a new USAGE_LIMIT_PERIODS map (daily/weekly/monthly to ms) and usageLimitPeriod validator throw INVALID_INPUT for unknown values, and checkUsageLimit now derives the window from the period (#397). Duplicate slugs map to a typed SmithersError (#393), usage and audit events validate their project (#394), and malformed metadata_json is tolerated with a warn rather than crashing reads (#328).
Usage tracking no longer reports impossible numbers or hits the network with dead tokens. parseAnthropicRateLimitHeaders and parseOpenAiRateLimitHeaders clamp the computed used count with Math.max(0, limit - remaining) so a burst window resetting mid-flight (remaining > limit) can no longer produce negative usage (#434), and claudeOauthUsage now checks creds.expiresAt and returns a descriptive error before sending a request that would 401 with an expired token (#390).
Gateway run-tree traversal is cycle-safe and prompts resolve from the workspace root. flattenGatewayRunNode (in gateway-client) now tracks visited node IDs in a Set and dedupes child IDs, and buildGatewayRunTree was extracted (in gateway-react) with cycle-safe recursion, guarding against infinite traversal on cyclic or duplicated node references (#436). Separately, the gateway resolves .smithers/prompts from a registered GatewayOptions.workspaceRoot (stored as this.workspaceRoot on the gateway and used by the listPrompts RPC) rather than process.cwd, so launch modes whose cwd differs from the workspace return the right prompts; and the prompts collection is keyed by entryFile so foo.md and foo.mdx no longer collapse to one id and drop a prompt.
Gateway React and client streaming reconnection is more robust. useGatewayExtensionStream clears error once a frame arrives after a failed attempt and makes its reconnect backoff abort-aware (via an AbortController) so an unmount during backoff resolves immediately; the gateway-client gap-resync check now reads the canonical top-level frame.event (not frame.payload.event) so bare replay frames are detected correctly (#419); a failed WebSocket open closes the socket before rejecting (#387); and useGatewayApprovals/useGatewayRuns/useGatewayWorkflows include params in their refetch dependencies so a param change actually refetches (#392).
Smaller fixes. A new waiting-event run state in deriveRunState, backed by approval-decided-resume-required and external-trigger variants on ReasonBlocked, disambiguates runs parked on an awaited event from generic blocked states (#410); the Zod-to-DDL generator maps float numbers to REAL columns instead of forcing INTEGER/TEXT (#312); the gateway RPC-contract test now types its expectedScopes map as GatewayScope so the scope assertions typecheck; eight duplicate isUniqueConstraintError/throwDuplicateSlugError definitions (four copies of each) were collapsed to one of each in the control plane; and the @tanstack/db version is pinned (exact 0.6.8) for gateway-client.

Agents, Scorers, Engine & Components

The 0.25.0 cycle hardened the durable spine. This section curates the ones that change correctness, data integrity, security, and debuggability across agents, scorers, the engine and scheduler, the react-reconciler, memory, and components.

Recording an error no longer corrupts the durable log. The engine writes failed tasks with JSON.stringify(errorToJson(error)) on its durable path, but errorToJson copied cause/details/context by raw reference, so a circular cause or a BigInt detail threw inside the error-recording code. errorToJson in packages/errors/src/errorToJson.js now runs its output through a cycle-aware sanitizer (BigInt to string, non-finite numbers to null, functions/symbols/undefined dropped, throwing getters skipped, cycles broken with a [Circular] sentinel, nested Errors serialized to {name, message, stack,...}). toTaggedErrorPayload.js coerces TaskTimeout/TaskHeartbeatTimeout numerics (attempt, timeoutMs, iteration, etc.) to a finite fallback so the attempt/timeout values the retry/backoff logic reads back after a JSON round-trip are never NaN.
Scorer aggregation closed a SQL injection vector. aggregateScores in packages/scorers/src/aggregate.js interpolated filter values (runId, nodeId, scorerId) into SQL via an escapeSql helper. It now uses an addFilter helper that pushes column = ? placeholders and collects a params array (run_id, node_id, scorer_id). rawQuery in packages/db/src/adapter.js was extended to forward params to queryAllRaw (Postgres) and stmt.all(…params) (SQLite).
Scores are clamped, validated, and replay-deterministic. packages/scorers/src/run-scorers.js had three durability bugs. NaN/Infinity/out-of-range scores poisoned DB aggregates and are now clamped to [0,1] and validated. A missing or non-numeric score silently dropped the row while the event log said finished, and it now throws SCORER_FAILED with no split-brain. Ratio sampling used unseeded Math.random so replay or fork could flip skip versus run, and it now derives the decision from a SHA-256 over the durable identity (runId, nodeId, iteration, attempt, scorerId).
Scorer events now flow through the durable stream. The runner emitted ScorerStarted/Finished/Failed via the bare EventEmitter.emit, which notifies only live in-process listeners and skips the persist hop, so scorer events were invisible to bunx smithers-orchestrator events, logs, tui, and the gateway even though scores landed in _smithers_scorers. They now route through emitEventWithPersist, falling back to a bare emit for a third-party bus that exposes only emit.
The scheduler tolerates stale completions for tasks that left the graph. taskCompleted/taskFailed in packages/scheduler/src/makeWorkflowSession.js failed the whole run when a completion arrived for a conditionally-rendered task whose parent re-rendered it out while it was still running, which discarded every other in-flight item in a fan-out run. A completion for a task no longer in the graph is now a stale no-op (record the output and re-decide on the current graph); a stale failure is ignored and re-decided.
render rejects on uncaught errors and preserves all top-level children. In packages/react-reconciler/src/reconciler.js an uncaught throw during render rethrew out-of-band (via defaultOnUncaughtError) while render resolved with a stale partial graph, so callers got a corrupt result. render now captures the error synchronously through a custom onUncaughtError callback and rejects, clearing the field so the instance is not poisoned for the next render. Separately, appendChildToContainer/insertInContainerBefore/removeChildFromContainer overwrote a single container.root, silently dropping all but the last child of a multi-root Fragment; the container now keeps an ordered roots array (HostContainer.ts) and derives container.root from it, wrapping multiple roots in a synthetic smithers:fragment.
A bad model id surfaces the real provider error instead of a JSON-schema error. When a provider rejects a request (for example a 404 for a bad model id) the AI SDK emits an error stream part and rejects the derived promises with a generic NoOutputGeneratedError. streamResultToGenerateResult in packages/agents/src/streamResultToGenerateResult.js dropped that part, so the engine misclassified it as the agent did not return valid JSON for the declared output schema. It now captures the first stream error part (and consumeStream’s onError) and re-throws the genuine provider error, so a bad model id fails with the real 404/APICallError.
Memory summarization is idempotent and lossless, and token limiting actually runs. saveMessage in packages/memory/src/store/MemoryStoreLive.js was a bare insert that crashed on a UNIQUE/PK conflict during replay or resume; it now upserts on id via onConflictDoUpdate. The Summarizer (packages/memory/src/Summarizer.js) deleted old messages before saving the summary, so a failed save lost them; it now saves the summary first and deletes after, so the state where the old messages are gone and no summary exists is never reachable. The TokenLimiter and Summarizer processors, previously no-op placeholders, are now implemented atop new listThreadsEffect/deleteMessagesEffect store effects.
Token-usage events attribute the real model id. TokenUsageReported in packages/engine/src/engine.js fell back to effectiveAgent.model, which is often unset for SDK agents and resolves to a random-UUID id for CLI agents, breaking per-model cost attribution and exploding the metric model tag cardinality. It now prefers the authoritative result.response.modelId and refreshes the agent span tag (attemptMeta.agentModel) from the same resolved id.
Smaller fixes. <Branch> now throws INVALID_INPUT (and types children?: never in BranchProps.ts) when JSX children are passed instead of silently dropping them (packages/components/src/components/Branch.js); SagaStep is exported as a value from Saga.js so import { SagaStep } works; SuperSmithers uses stable output keys (super-smithers-read/propose/apply) and its apply path now instructs the agent to edit files on disk rather than only describe the changes; the scheduler decide depth guard (guarded at 10) returns a Failed result with SCHEDULER_ERROR instead of a silent Wait (#385); the engine stops advancing EventBus seq for DB-assigned events (#398); RESUME_METADATA_MISMATCH gained actionable remediation that prints the real bunx smithers-orchestrator fork <workflow> --run-id RUN_ID --frame <n> shape (or bunx smithers-orchestrator up <workflow> to start fresh); and observability’s withCorrelationContext is documented to run via Effect.runPromise/runFork (never runSync) to avoid wedging AsyncLocalStorage.

Testing & Quality

This release pushed hard on the repo’s no-mocks, real-backend testing philosophy: fault cases that used to seed raw SQL rows or hand-roll WebSocket fakes now boot a real engine and a real Gateway, the exactly-once SIGKILL-survival guarantee is finally proven across processes, and a broad unit-coverage sweep plus new CI gates lock in regressions that were previously discoverable only one slow round-trip at a time.

Real-engine SIGKILL/resume durability is now proven across processes. e2e/faults/case31-real-engine-kill-resume.test.ts spawns the engine in a separate OS process (e2e/harness/engineChildRunner.ts) against an on-disk sqlite DB, polls a sidecar marker to time the kill, SIGKILLs the real engine pid mid-node, then resumes the same run from the same DB in a fresh process. It asserts the exactly-once guarantee: the run reaches finished, each node commits exactly one output row, a node that committed before the kill is not re-executed, and the interrupted node re-runs and completes. This is the first time the #1 durability guarantee is verified against the real engine across processes (every prior crash case01-06 only simulated it with a stale heartbeat in an in-memory table), and a manually injected regression that deletes a committed checkpoint makes the test fail as designed.
The gateway, approval, rewind, and diff-review fault cases were rewritten to drive the real product path. case14 and case17 dropped 1,256 lines of hand-rolled WebSocket/SQLite fakes to spin up a real Gateway and drive RPC round-trips (launchRun then submitApproval, plus a viewer-scope rejection) and bad-signature webhook payloads through the actual HTTP stack (#415). case03 replaced raw-SQL row seeding with createSmithers + runWorkflow, then submits the decision via the real /v1/rpc/submitApproval HTTP endpoint (#412); case12 drives the real rewind path via the time-travel jumpToFrame and rewindAudit APIs (#416) and case26 the real diff-review path via createSmithers + executeSandbox + getNodeDiffRoute (#417). These complete the no-mocks remediation in the fault-injection e2e matrix (epic 0022).
Workspace coverage and new gate checks are wired into CI. scripts/coverage.mjs (run via pnpm coverage) produces a workspace coverage report uploaded as the coverage-workspace artifact in.github/workflows/ci.yml. The root test script now runs check-single-effect-version, check-dependency-boundaries, check-docs, check-llms, and the new check-smithers-test-script gate (scripts/check-smithers-test-script.mjs, which asserts every workspace that has runtime test files declares a test script) before pnpm -r —no-bail test, so a single CI run reports every failing package at once instead of bailing at the first.
The dependency-boundary check now actually scans the e2e workspace. scripts/check-dependency-boundaries.mjs filesForPackage only looked under src/, so it scanned zero files for e2e (whose sources live in faults/, exports/, harness/). It now falls back to the package root when src/ is absent, which surfaced real undeclared imports (react, zod, smithers-orchestrator, @smithers-orchestrator/time-travel) that were only resolving via hoisting; those are now declared in e2e/package.json. The check passes for 38 packages and completes audit epics 0047 and 0052.
A broad unit-coverage sweep closed branch-level blind spots across the published packages. New suites cover the engine, gateway, server, driver, scheduler, vcs, usage, sandbox, agents, review, observability, time-travel, gateway-client, gateway-react, and pi-plugin, targeting workflow-session decide and deadlock branches (packages/scheduler/tests/workflowSession-decision-depth.test.js and workflowSession-service-branches.test.js), agent stream-json interpreters (Antigravity, Gemini, Hermes, Vibe), and OTLP severity/correlation edge cases (apps/observability/tests/otel-severity.test.js and correlation.test.js). The gateway RPC scope contract test now pins getSchemaSignature and listDocs as run:read in packages/gateway/tests/rpc-contract.test.ts so new RPC methods cannot ship without a declared scope.
A z.number fractional round-trip regression is now locked against a real column. Following the #312 fix that maps plain numbers to a SQLite REAL column (so 0.95 is no longer truncated to 0 by INTEGER affinity), packages/db/tests/db-output-roundtrip.test.js asserts getSQLType returns real and that 0.95 and 0.0123 round-trip losslessly through a real insert+select, so the fix cannot silently regress and users no longer need the z.string workaround.
Real-CLI e2e tests were brought onto a single, honest skip convention. The agents e2e suites dropped the undocumented SMITHERS_REAL_CLI_E2E=1 opt-in guard, matching opencode-e2e and vibe-agent-e2e: they now skip only when the agent binary is absent or lacks required flags. Skipped e2e fault cases were promoted where possible and every remaining skip was given a tracking link, and the chat-create e2e was made to run hermetically by honoring OPENAI_BASE_URL in the OpenAI diagnostics path (packages/agents/src/diagnostics/getDiagnosticStrategy.js).

Upgrade

bunx smithers-orchestrator@0.25.0

​PostgreSQL, Migrations & the Durable Backend

​Gateway: New RPCs & the Sync Layer

​The Real UI, Removed POC UI, and a Green Pipeline

​Engine, Typed Outputs & Observability

​Agents & New Tools

​Workflows & Components

​CLI: Interactive Runs, Migrate & Memory

​Evals & Benchmarks

​Reliability & Correctness

​CLI, Review & e2e

​Gateway, DB, OpenAPI & Control-Plane

​Agents, Scorers, Engine & Components

​Testing & Quality

​Upgrade