bunx smithers-orchestrator migrate and a fail-loud SMITHERS_MIGRATION_REQUIRED gate so a legacy store is never silently abandoned. The Gateway gains read and write RPCs (listDocs, listAccounts, listPrompts, listScores, listMemoryFacts, getSchemaSignature, and ticket CRUD) plus a reactive TanStack DB sync layer with useGateway* hooks, opt-in SQLite-WASM/OPFS persistence, and a cloud Electric sync source. ctx.output, ctx.outputMaybe, and ctx.latest now infer their row types from the table you pass. The @smithers-orchestrator/agents package adds five vendor-neutral tool primitives (grounded web search, generic HTTP, transcription, image generation, and document/OCR parsing), and the CLI folds its tui command into up --interactive. The proof-of-concept UI apps were removed ahead of the real UI, now in preview at ui-preview.smithers.sh. Alongside: the <Sidecar> composite and a family of validation-first audit workflows, the feature-eval-factory and SWE-EVO upgrades, roughly 160 fixes across the CLI, gateway, scorers, OpenAPI, and control-plane, and a testing sweep that proves the exactly-once durability guarantee against the real engine across OS processes.
PostgreSQL, Migrations & the Durable Backend
The durable control plane is no longer tied to bun:sqlite. 0.25.0 adds a backend abstraction (SQLite, PGlite, or Postgres), a one-command migration path that copies legacy run history forward, and a fail-loud gate so an old store is never silently abandoned. New schema-signature and DB-backed docs primitives, plus a SIGKILL/resume durability fault case, round out the backend hardening.
- Backend choice resolution with a fail-loud migration gate. resolveSmithersBackendChoice in packages/smithers/src/resolveSmithersBackendChoice.js resolves the storage backend in precedence order (explicit option, then SMITHERS_BACKEND, then
backendin.smithers/smithers.config.ts, then the defaultpglite). When a legacy bun:sqlite store still holds runs and the resolved backend is pglite or postgres with no.smithers/migrated.json marker, it throws SMITHERS_MIGRATION_REQUIRED rather than silently switching stores. createSmithers in packages/smithers/src/create.js rejects an explicit pglite or postgres backend with INVALID_INPUT (pointing at the async openSmithersBackend factory) rather than degrading to bun:sqlite; createSmithersPostgres is the PGlite/Postgres factory. - New
migratecommand copies legacy run history forward.bunx smithers-orchestrator migrate(apps/cli/src/index.js) copies the legacy bun:sqlite smithers.db into PGlite or Postgres and writes the.smithers/migrated.json marker. It accepts--to pglite|postgres(default pglite),--urlfor a Postgres connection string, and--keepSqlite(default true), backed by migrateSmithersStore in packages/smithers/src/migrateSmithersStore.js which orders tables so _smithers_runs is copied first and emits per-table progress events. A SmithersError during migration exits with code 4 (other failures exit 1), and a--backendflag is honored on read commands so the SMITHERS_MIGRATION_REQUIRED remediation works. - Postgres SQL dialect translation layer. packages/db/src/dialect.js exports SQLITE and POSTGRES constants plus translatePlaceholders, columnType, translateDdl, quoteIdentifier, beginTransactionSql, and jsonExtractText, so the adapter in packages/db/src/adapter.js branches on internalStorage.dialect === POSTGRES for upserts, RETURNING clauses, and transaction control across both engines from one code path. translatePlaceholders rewrites
?to$1, $2,...while skipping string literals, quoted identifiers, and comments. - Schema-signature versioning for client/server drift detection. getSmithersSchemaSignature in packages/db/src/getSmithersSchemaSignature.js returns the durable schema head (
schemaVersion, the leading digits of the latest _smithers_schema_migrations id) plus a sha256signatureover the sorted internal table catalog and per-tablecomponents, so clients can gate persistence on the server schema head and detect a same-head table-shape drift. - DB-backed
_smithers_docstable for Smithers markdown artifacts. Migration 0018_add_docs creates the _smithers_docs table (its index migrations parse the target table name so a ledger predating the table is handled), served through upsertDoc/upsertDocRow, getDoc, listDocs, and a softDeleteDoc tombstone path in packages/db/src/adapter.js, exposing Smithers docs (tickets/plans/specs/proposals) as durable rows rather than loose files. - New gateway RPCs, an Electric write endpoint, and run-event backpressure. getSchemaSignature and listDocs are added to the gateway RPC surface (packages/gateway/src/rpc/index.ts, typed in packages/gateway-client/src/GatewayRpcTypeMap.ts). packages/server/src/gateway.js imports getSmithersSchemaSignature, serves a POST /v1/electric/write endpoint (handleElectricWrite) that runs the launch through the normal gateway RPC path and returns a null txid, and disconnects a slow streamRunEvents subscriber with a BackpressureDisconnect once its per-subscriber outbound queue overflows (the shared backpressure bounds live in packages/server/src/GatewayExtensions.js). A local Electric stack ships under deploy/electric/ (docker-compose.yml, initdb/001_smithers_min.sql, smoke.ts, README.md), and the work is documented at /cli/overview, /deployment/production-hardening, /rpc/get-schema-signature, and /rpc/list-docs.
- Real-engine SIGKILL and resume durability fault case. e2e/faults/case31-real-engine-kill-resume.test.ts spawns a real engine child process against an on-disk SQLite DB, SIGKILLs it mid-node (polling a
B.startedmarker, no fixed-sleep race), then resumes in a fresh process and asserts the run reaches “finished”, each committed node has exactly one output row (no double-commit), and the pre-kill node A appears exactly once in the execution-counter file (exactly-once side effect).
Gateway: New RPCs & the Sync Layer
0.25.0 turns the Gateway into a control plane for live UIs. Workspace surfaces (runs, approvals, crons, memory, scores, tickets, prompts, accounts, docs) each get a typed RPC plus a reactive TanStack DB collection, the same collections can persist to the browser across reloads, and they can be fed either by the local RPC+WS transport or by a hardened cloud Electric proxy with no change to the consuming component.
- New read RPCs expose every workspace surface.
listMemoryFacts(scopememory:read),listScores(score:read),listPrompts(prompt:read),listAccounts(account:read),listDocsandgetSchemaSignature(run:read) were added toGATEWAY_RPC_DEFINITIONSin packages/gateway/src/rpc/index.ts, bringing the contract to 29 methods (thecheck-docsgate asserts exactly 29).listAccountsreads the real~/.smithers/accounts.jsonregistry via@smithers-orchestrator/accounts/listAccountsand redacts secrets over the wire, carryinghasApiKey/hasConfigDirposture flags instead of the plaintext key. - Tickets became a full CRUD surface backed by a docs table.
listTickets/createTicket/updateTicket/deleteTicket(scopesticket:read/ticket:write, with aTicketNotFound404 declared on update/delete) write to a new_smithers_docstable (migration0018_add_docs) holding tickets, plans, specs, and proposals with astatuscolumn and adeleted_at_mssoft-delete tombstone.SmithersDbgainedlistDocs/getDoc/upsertDoc/softDeleteDocin packages/db/src/adapter.js, andlistDocsfilters tombstones rather than returning them. - Reactive TanStack DB collections and useGateway hooks ship for each surface.* packages/gateway-client/src/sync/gatewayCollectionDefs.ts adds collection defs (
crons,memoryFacts,scores,tickets,prompts) keyed by stable identifiers (scores by the compositerunId:nodeId:iteration:scorerId, memoryFacts by${namespace}:${key}, prompts byentryFile), and packages/gateway-react exposesuseGatewayCrons,useGatewayMemoryFacts,useGatewayScores,useGatewayTickets, anduseGatewayPromptsover the sync-client registry built bycreateGatewayCollections, so a surface renders live gateway data with no in-app seed. Thecronscollection rides the existingcronListRPC. - Collections persist across reloads via SQLite-WASM and OPFS. packages/gateway-react adds opt-in client persistence:
createGatewayPersistencebuilds aPersistentCollectionStoreovercreateSqliteWasmBackend, which uses an OPFS SAHPool VFS (durable without cross-origin isolation or SharedArrayBuffer).withPersistencehydrates a collection’s first sync commit synchronously from cache (no fetch flash) and writes through live changes, and aschemaVersionmismatch drops the whole cache.@sqlite.org/sqlite-wasmis injected by the consuming bundler and is only a devDependency of gateway-react, so the package carries no hard wasm runtime dependency. - A cloud Electric sync source feeds the same collections behind a switch.
createElectricCollection<TRow,TKey>in packages/gateway-client/src/sync/createElectricCollection.ts produces the same TanStack DBCollectionConfig(matching collection-id fingerprint andgetKey) from an ElectricSQL shape (GET /v1/shape?table=…snapshot plus live long-poll tail) instead of the RPC+WS transport.createGatewayCollectionsin packages/gateway-react gains asyncSourceof"gateway"or"electric"; it engages Electric only when the source iselectricAND anelectricconfig is present, otherwise it falls back to the gateway path.ShapeStreamis loaded by dynamicimport("@electric-sql/client")so gateway-only bundles tree-shake it out. - The new electric-proxy package enforces grant-scoped shapes.
@smithers-orchestrator/electric-proxyfronts a realelectricsql/electricservice (SMITHERS_ELECTRIC_URL) with auth, scope, and grant-based shape filtering, deriving each caller’s granted run ids from the gateway (SMITHERS_GATEWAY_URL). A run- or workspace-scoped shape with no concrete grant fails closed. Run it via thesmithers-electric-proxybin (portSMITHERS_ELECTRIC_PROXY_PORT, default 8443); the package exportscreateSmithersElectricProxy,serveSmithersElectricProxy,smithersElectricShapeCatalog, a metrics factory, and an observer. - Run-event streaming is bounded by backpressure on both ends. In packages/server/src/gateway.js the server caps each subscriber’s outbound queue at
RUN_EVENT_STREAM_OUTBOUND_QUEUE_LIMIT(1000 frames) drained against the WS socket’sbufferedAmount, disconnecting a slow consumer with aBackpressureDisconnecterror and incrementing the newgatewayRunEventBackpressureDisconnectTotalmetric. On the client, packages/gateway-client/src/sync/createGatewayCollection.ts replaces its unbounded buffer with a bounded queue that sheds the oldest unapplied frame pastmaxBufferedFrames(defaultmax(maxRows, 1024)). - Docs sync from disk into the DB via a file watcher. packages/engine adds
syncDocsFromDisk,createDocWatcher, andstartDocFileSyncso tickets, plans, specs, and proposals under.smithers/are mirrored one-directionally (last-write-wins on a content_hash mismatch) into_smithers_docs, exposed through thelistDocsRPC. Separately, thebunx smithers-orchestrator guishortcut and thebunx smithers-orchestrator <dir>directory shortcut now route to the Gateway UI instead of launching the native macOS app (#435).
The Real UI, Removed POC UI, and a Green Pipeline
The product UI moved to its own repository, so 0.25.0 deletes the local POC UI apps that were dead weight and never passed CI on a clean, browser-free box. This release also pays down a large dead-code backlog and hardens the build, test, and publish gates so a single CI run reports every failing package at once.


- The proof-of-concept UI apps are gone. Removed
apps/smithers(the Cerebras PWA POC),apps/smithers-studio-2(studio shell POC), andapps/smithers-demoin one commit, thenapps/smithers-tui-demoin a follow-up once its only remaining importer (smithers-demo) was deleted. Theapp-e2eCI job that ran their browser e2e (it installedplaywright/chromium on the runner, which contradicts the repo’s clean-box, no-browser CI philosophy) was dropped, along with thedev,dev:studio,demo:smithers,demo:smithers:tui, andtest:e2e:appsscripts and their package-configuration.mdx rows. All four apps were"private": trueand were never published to npm. The real UI is previewable at https://ui-preview.smithers.sh. - A 1,764-line legacy engine body was deleted.
runWorkflowBodyLegacyin packages/engine/src/engine.js was the pre-driver implementation, reachable only via the__smithersEngineMode === "legacy"option and theSMITHERS_LEGACY_ENGINE=1env var, neither set in production nor documented anywhere.runWorkflowBodynow passes straight through torunWorkflowBodyDriver, dropping engine.js from 7,775 to 6,011 lines at the time of the change, with the full 646-test engine suite green. The engine’s legacy-mode tests still pass that option, so it is now a silent no-op that resolves to the driver path. - A dead-code audit removed roughly 36 orphaned modules. Audit ticket 0048 (#301) deleted files with zero product or test consumers across many packages: the in-memory db storage module, duplicate db
output/andframe-codec/barrels, the protocolProtocolErrorandoutputs.tstypes, scorers and memoryreact-types.ts, the smitherssrc/ide/subtree (16 files), vcsWorkspaceSnapshot.ts, the non-durable enginedeferred-bridge.js, and the cliserveSemanticMcpServerwrapper function, each verified by grep plus per-package typecheck and test runs. - The CI test job no longer hides a backlog of failures. The root
testscript now runspnpm -r --no-bail testso one CI run reports every failing package at once instead of stopping at the first (which had been masking failures behind packages/smithers whenrgwas missing). Ripgrep is now installed on the runner (apt-get install -y ripgrep) so packages/smithers’ grep tool tests do real searches, andpnpm lint(oxlint),typecheck:examples, andtypecheck:evalsare now enforced gates per audit 0047 (#300). - CI gained a workspace coverage report. A new
pnpm coveragestep (scripts/coverage.mjs) runs in the test job and uploads acoverage-workspaceartifact, and selected examples/ bun tests (the porting-rules and context-handoff suites) now run in CI so the structural porting rules are gated. - Publish gains an LLM-backed feature-doc-sync gate. The release workflow (.smithers/workflows/release.tsx) runs a tool-enabled
feature-doc-syncagent (agents.smartTool) that, per.smithers/prompts/feature-doc-sync-audit.mdx, diffs from the last release tag (git describe --tags --match 'v*' --abbrev=0) and verifies every new or changed user-facing feature is recorded in both.smithers/specs/features.tsand docs/, then afeature-doc-sync-gatetask throws an actionable message listing the missing entries on drift. AskipFeatureDocSyncinput flag bypasses it intentionally, mirroring the existing changelog and marketing checks. - The release drift guard is now robust to non-deterministic declarations. scripts/publish.mjs previously failed the post-build drift check on any changed file, but rollup-plugin-dts emits non-deterministic
*.d.tsfor large packages so a committed copy can never byte-match a fresh build. The guard now excludes*.d.ts(the samepnpm -r buildregenerates every declaration immediately beforepnpm publishpacks the tree) while still failing on stale deterministic artifacts like openapi.yaml, the llms bundles, and the seeded workflow pack. - The release changelog now diffs since the previous release tag.
probeReleasein.smithers/lib/release-content/git.ts diffed<latest tag>..HEAD, which is empty oncepnpm versionhas tagged the new version at HEAD, producing a nearly blank 0.25.0 changelog despite ~450 commits since v0.24.2. Its newpreviousReleaseTaghelper treatspkg.versionas the target whenv<pkg.version>already exists and diffs from the highestvMAJOR.MINOR.PATCHtag strictly below it.
Engine, Typed Outputs & Observability
This release sharpens the durable control plane from the type layer down to the metrics. Workflow output reads are now strongly typed against their table schema, the engine parks and revives runs around provider quota limits instead of failing them, and the observability surface stops reporting bogus model ids and unbounded event streams.- Typed output rows from the table argument.
ctx.output,ctx.outputMaybe, andctx.latestpreviously returned an untypedOutputRow, so reads likectx.outputMaybe(outputs.research,...).summaryresolved tounknownand.smithersworkflows failedtscwith about 85 errors. A newResolveOutputRow<Schema, T>helper (packages/driver/src/ResolveOutputRow.ts) plus per-method overloads in packages/driver/src/SmithersCtx.js now infer the row from a string table name (keyed into the workflow Schema), a Zod schema object (z.infer), or a Drizzle table ($inferSelect), falling back toOutputRowfor widened or unknown args so loosely-typed callers are unchanged. - Quota-aware pause and resume. Runs that hit a provider quota limit now park as
waiting-quota(added to RunStatus in packages/driver/src/RunStatus.ts, DB_RUN_ALLOWED_STATUSES, and the suspending-status checks) instead of failing, persistingquotaBlockedCountandresetAtMsin the runerrorJsonand surfacing them throughderiveRunState(packages/db/src/runState/deriveRunState.js) asblocked.kind = quota.classifyQuotaErrorin packages/agents/src/BaseCliAgent/BaseCliAgent.js was hardened to parse ordinal dates and retry-after-N-seconds windows across all eight quota patterns, and the new status is documented in /reference/types and /reference/errors. .smithersdocs synced to the DB by an optional file watcher. NewsyncDocsFromDisk,createDocWatcher, andstartDocFileSync(packages/engine/src/*) mirror tickets, plans, specs, and proposals under.smithersinto the_smithers_docstable, gated behindSMITHERS_DOCS_FILE_SYNC=1. Paths are validated to stay inside.smithersand dropped edits emit adocs.watcher.droppedwarning under thedocs:file-synclog span. The gateway also serves a matchinglistDocsread RPC and agetSchemaSignatureRPC, both defined in packages/gateway/src/rpc/index.ts and dispatched by packages/server/src/gateway.js.- Re-render trigger reasons for frame observability.
RenderContextnow carries atriggerwith aRenderTriggerReason(task-finished,timer-fired,cache-resolved,loop-advanced,deadlock-check,stability-check) threaded throughmakeWorkflowSession(packages/scheduler/src/RenderContext.ts, makeWorkflowSession.js), so every re-render frame records why it fired. The newrequireRerenderOnOutputChangeflag onRunOptions(packages/driver/src/RunOptions.ts) opts a run into per-output-change re-renders. - Honest per-model token attribution.
TokenUsageReportedcomputed the model aseffectiveAgent.model ?? effectiveAgent.id ?? "unknown", so SDK agents (which leavemodelunset) and unnamed CLI agents fell through to a random UUIDid, making per-model cost impossible and exploding the metricmodeltag cardinality. The engine now prefers the authoritativeresult.response.modelId(set for both SDK and CLI agents) and refreshes the agent span tagattemptMeta.agentModelfrom the same resolved id (packages/engine/src/engine.js). - Run-event stream backpressure disconnects. The gateway now bounds each subscriber’s outbound run-event queue at
RUN_EVENT_STREAM_OUTBOUND_QUEUE_LIMIT = 1000frames; a slow consumer that overflows is disconnected with aBackpressureDisconnecterror and increments the newsmithers.gateway.run_event_backpressure_disconnect_totalcounter (packages/server/src/gateway.js, apps/observability/src/metrics/gatewayRunEventBackpressureDisconnectTotal.js). - Correlation-context footgun fixed and dead code removed.
withCorrelationContext(apps/observability/src/_coreCorrelation/withCorrelationContext.js) must run viaEffect.runPromise/runFork, neverrunSync, because itsAsyncLocalStorage.enterWithotherwise leaks ALS async-hooks onto the caller’s context; this is now documented and exercised that way. The obsolete legacy engine body (1764 deleted lines in packages/engine/src/engine.js) and the dead in-memoryMetricsServiceimplementation (makeInMemoryMetricsServicein apps/observability/src/_coreMetrics.js) were removed as part of audit 0048.
Agents & New Tools
0.25.0 turns the agents package into a real toolbox. Five new AI SDK-compatible tool factories ship inpackages/agents/src, all built on a provider-boundary pattern so a workflow never hard-codes a single vendor, and the agent runtime itself gets preflight checks, quota-aware suspension, and honest error reporting.

- Grounded multi-provider web search.
createGroundedWebSearchToolset({ providers, maxResultsPerProvider })(exported from@smithers-orchestrator/agents, source at packages/agents/src/web-search/createGroundedWebSearchToolset.js) exposes a singlegrounded_web_searchtool that fans a query across Exa semantic retrieval plus a fresh/SERP provider (Tavily, Brave, or Serper, built viacreateExaSearchProvider/createTavilySearchProvider/createBraveSearchProvider/createSerperSearchProvider). It throws unless given Exa as the semantic provider plus at least one fresh provider, runs them withPromise.allSettledso one failure does not sink the call, dedupes by normalized URL (hash stripped), and returns numberedcitationresults with afreshnessfilter of day/week/month/year (#319). - Generic HTTP escape-hatch tool.
createHttpTool(options)(packages/agents/src/http/createHttpTool.js, re-exported from thesmithers-orchestratorfacade at packages/smithers/src/index.js) lets an agent call any REST endpoint with no OpenAPI spec. The Zod input accepts method, url, headers, query, body, an optionaltimeoutMs(wired to anAbortController), and a discriminatedauthunion keyed ontypewithbearer,basic, orheadervariants; non-string JSON bodies auto-setcontent-type: application/json, and it returns{ ok, status, statusText, headers, body }(#318). - Audio transcription via Whisper or Deepgram.
createTranscriptionTool({ provider, apiKey, model?, baseUrl? })(packages/agents/src/transcription/createTranscriptionTool.js) accepts either anaudioUrloraudioBase64(withmimeType,language, andprompthints) and normalizes both providers to{ text, provider, language?, durationSeconds? }. Whisper defaults to modelwhisper-1againsthttps://api.openai.com/v1/audio/transcriptions; Deepgram defaults tonova-3withsmart_formatagainsthttps://api.deepgram.com/v1/listen(#313). Documented at docs/integrations/sdk-agents.mdx. - Image generation and document/OCR parsing primitives.
createImageGenerationTool(provider, options)(packages/agents/src/image-generation/createImageGenerationTool.js) wraps a pluggableImageGenerationProvider.generateImage(request)behind agenerate_imagetool that takes prompt, model, size, count, seed, and style.createDocumentParsingToolset(options)(packages/agents/src/document-parsing/createDocumentParsingToolset.js) exposes aparse_documenttool whosesourceaccepts url/base64/text and turns documents into text/markdown via a Firecrawl default provider, withmistral-ocrandllamaparseselectable by string or a custom provider object (#317, #320). - Gemini CLI agent sunset for Antigravity, plus agent preflight.
AntigravityAgent(packages/agents/src/AntigravityAgent.js) is added and exported from the package and the facade;GeminiAgentis marked@deprecated(the export still ships) and the gemini subscription provider is dropped from the accounts provider union.BaseCliAgent.preflightrunslaunchDiagnosticsbefore the first generation; the engine invokes it once per agent per run (cached via aWeakMapinrunAgentPreflightOnce) and a failed check throws a non-retryableAGENT_CONFIG_INVALIDerror (#443). - AmpAgent session resume. AmpAgent’s
buildCommand(packages/agents/src/AmpAgent.js) now emitsamp threads continue <id>at the front of its args when a task passesoptions.resumeSessionorthis.opts.resume, and skips the new-thread-only flags--visibilityand--archiveon resume; the newresumeoption lands onAmpAgentOptions(#302). - Quota-aware pause and resume. The engine, scheduler, and agents now classify provider quota/rate-limit errors and park a run in a new
waiting-quotastatus (packages/driver/src/RunStatus.ts) instead of failing it.classifyQuotaErrorcovers eight regex patterns (including retry-after-N-seconds and ordinal reset dates) and emitsAGENT_QUOTA_EXCEEDED; the engine persistsquotaBlockedCountandresetAtMsinto the run’serrorJson, andderiveRunStatesurfaces ablocked.kind: "quota"reason so the run resumes once quota resets. - Real provider errors instead of a fake JSON-schema failure.
streamResultToGenerateResult.jspreviously dropped the AI SDKerrorstream part, letting a maskingNoOutputGeneratedErrorpropagate so a bad model id looked like the agent returning invalid JSON for the declared output schema. It now captures the first streamerrorpart (andconsumeStream’sonError) and re-throws the genuine provider error, so a 404 bad-model request fails with the real providerAPICallError.
Workflows & Components
This release pushes on the orchestration surface: a new composite component for measuring whether a cheaper model is good enough, a set of long-running workflows that prove their own work instead of self-reporting, and a round of correctness fixes across the structural components so authoring mistakes fail loud at compile or graph time rather than silently dropping tasks.- New
<Sidecar>composite for cheap-model shadow scoring.<Sidecar agent={...} sidecar={...} output={...} scorers={...}>renders a<Parallel>(id${id}-parallel) wrapping two<Task>children over the same prompt: the primary keeps the componentidso downstreamneedscan consume it, while the shadow task runs at${id}-sidecarwithcontinueOnFail: trueand writes its own scorer rows. The companioncomputeSidecarDelta(rows, { primaryNodeId, sidecarNodeId, scorerId })reads persisted scorer rows and derivesprimaryScore,sidecarScore,delta, and acheaperWinsboolean (true when the sidecar score is at least the primary). Both are exported fromsmithers-orchestrator; see /components/sidecar. - audit-burndown, a multi-week validation-first workflow that proves its own work.
.smithers/workflows/audit-burndown.tsxburns down.smithers/tickets/smithers/*.mdone open- [ ]checkbox at a time in isolated<Worktree>nodes under a bounded<Parallel>. The authoritative gate is a deterministic completeness-oracle compute task that, after each batch merges to local main, runs the realpnpm typecheckthenpnpm teston the real main checkout (agents cannot fake a spawnSync exit code) and writes acompletenessrow queryable viabunx smithers-orchestrator node oracle -r <run>. An outer<Loop>stops only when the open count hits zero, the full gate is green, and a push fence confirms no agent moved origin. - bulletproof-audit and audit-fix-train for production-readiness scoring.
.smithers/workflows/bulletproof-audit.tsxruns one read-only Codex (GPT-5.5) auditor per feature group in.smithers/specs/features.ts, scoring each across 10 dimensions (e2e, unit, observability, architecture, JSDoc, docs, durability, type-safety, security, evals), then a deterministic writer emits.smithers/audits/bulletproof-audit.md..smithers/workflows/audit-fix-train.tsxturns that backlog into landed code: a jj-native plan, implement, review loop per finding, serialized through a<MergeQueue maxConcurrency={1}>onto local main, with a final jj fetch, rebase-onto-origin, and push (the only workflow here that pushes, gated behind apushinput). - coverage-codex-swarm and plan-implement-review-issues join the init pack.
.smithers/workflows/coverage-codex-swarm.tsxfans out per-package coverage work in isolated worktrees and now accepts an optionalpackageInstructionsrecord (z.record(z.string, z.string)) so callers thread per-package guidance into each prompt without editing the workflow..smithers/workflows/plan-implement-review-issues.tsxdiscovers every open GitHub issue, treats each unchecked- [ ]checkbox as a work item, drops items that already have an open PR, groups related items, and opens exactly one deduped PR per group withCloses #N/Relates to #Nlinkage, with an agent-array rate-limit failover chain on every role (plan, implement, validate, review, PR). - Interactive run mode replaces the standalone tui command. The
tuicommand is removed; its clack picker plus live status card now runs behindbunx smithers-orchestrator up --interactiveandbunx smithers-orchestrator workflow run --interactive. A bareuporworkflow runwith no workflow arg on a TTY also launches it, while passing a path or ID preselects and skips the picker. Non-TTY--interactivefails withINTERACTIVE_REQUIRES_TTY(exit 4); a missing arg without a TTY fails withWORKFLOW_REQUIRED;-istays bound to--input. - Structural components now fail loud instead of silently dropping work.
<Branch>resolves its subtree fromthen/elseand never read children, so<Branch>...</Branch>used to drop those tasks silently; it now throws anINVALID_INPUTSmithersError and typeschildren?: neverfor a compile-time error.SagaStepis now a real named value export (soimport { SagaStep }works, alongside<Saga.Step>). The kanban workflow’s implement prompt and result task now commit work on the worktree branch before merge, fixing a bug where converged work was dropped as already up to date (only ~17% of tickets landed). <SuperSmithers>applies edits for real, and composite docs embed verifiable source.<SuperSmithers>previously left its apply path as a no-op compute that returned a literal{ applied: true }without writing anything; the non-dry-run path now directs the agent to make file edits on disk with its editing tools, and the compute step is demoted to an honest dependency barrier ahead of the report. Separately,scripts/generate-component-source.mjs(run viapnpm docs:components) injects a tabbed## Source<CodeGroup>into each composite component doc, delimited by GENERATED:COMPONENT-SOURCE markers and stripped from the llms-*.txt bundles.
CLI: Interactive Runs, Migrate & Memory
Smithers 0.25.0 reshapes the run verbs around an interactive picker, hardens the SQLite-to-PGlite migration path, and rounds out cross-run memory. The CLI now reads a paused run as a decision point instead of a failure, and tears runs down cleanly on signal.--interactivereplaces the standalonetuicommand.bunx smithers-orchestrator up --interactiveandbunx smithers-orchestrator workflow run --interactivelaunch the clack flow (fuzzy workflow picker, input prompts, live status card with inline approval/human gates, gate-then-resume), and a bareup/workflow runwith no workflow arg on a TTY launches it too. Passing a workflow path (up) or ID (workflow run) preselects it and skips the picker. Non-TTY--interactivefails withINTERACTIVE_REQUIRES_TTY(exit 4);-istays bound to--input. The runner lives in apps/cli/src/tui.js (runTuiCommand).- Approval and human gates resolve inline inside the interactive card. apps/cli/src/tui-gates.js drives
handleApprovalsandhandleHumanRequestsvia clackselect/confirm/text, callingapproveNode/denyNodefrom @smithers-orchestrator/engine/approvals throughEffect.runPromisewith thesmithers:tuisource. It re-checks each approval is still pending before deciding (so out-of-band resolutions are skipped) and leaves the run paused if you cancel the approval prompt. smithers migratecopies the legacy bun:sqlite store into PGlite or Postgres.bunx smithers-orchestrator migrate(with--to pglite|postgres, default pglite, and--keepSqlite, default true) importsmigrateSmithersStorefrom smithers-orchestrator, streams per-table[smithers] migrated <table>: <targetRows>/<sourceRows> rowsprogress to stderr, and writes themigrated.jsonmarker. Read commands in apps/cli/src/find-db.js now runassertSmithersReadBackendbefore opening the legacy store, which callsresolveSmithersBackendChoiceso a pglite/postgres resolution with no marker fails loud instead of silently reading stale SQLite.--backendis honored on read commands so the migration remediation actually works.up/gateway/monitor/workflow runregister--backend sqlite|pglite|postgres, but read commands (ps,inspect,output, …) did not, so theSMITHERS_MIGRATION_REQUIREDerror’s own--backend sqliteadvice was rejected as an unknown flag.extractBackendFlagin apps/cli/src/argv-utils.js now lifts--backend <value>(or--backend=value) out of argv on any command and setsSMITHERS_BACKENDso the resolver honors it everywhere.bunx smithers-orchestrator memorygains get, set, and rm. Thememorygroup previously only hadlist;get,set, andrmnow wrap the memory store’sgetFact/setFact/deleteFactover a parsed namespace plus key (e.g.workflow:my-flow), withsettaking an optional--ttlin milliseconds.getprints the fact’svalueJsonand reports a missing fact cleanly. Failures surface asMEMORY_GET_FAILED/MEMORY_SET_FAILED/MEMORY_RM_FAILED. Code in apps/cli/src/index.js (#302).- A paused
upnow prints decision CTAs instead of generic failure hints. Whenupexits 3 in awaiting-approval/waiting-event/waiting-timerstate, the result block prepends a ‘Run is paused (exit 3 = awaiting a decision, not a failure)’ header pluspauseCtas: approve/deny/why for approvals, signal/why for events, and why for timers, ahead of the usual inspect/logs CTAs. - SIGINT and SIGTERM now durably cancel the run.
setupSqliteCleanuppreviously registeredprocess.on("SIGINT"/"SIGTERM")handlers that calledprocess.exit(130/143)before the graceful abort, leaving the runstatus:"running"with a frozen heartbeat until the 30s stale lease (RUN_HEARTBEAT_STALE_MS) expired. Those signal handlers are dropped (process.on("exit", closeSqlite)still cleans up), so the abort path runs and durably writesstatus:"cancelled", with a second-signal/5s-unref force-exit backstop insetupAbortSignaland the gateway shutdown.
Evals & Benchmarks
This release turns the eval corpus into something that generates and repairs itself, and hardens the SWE-EVO benchmark into a full-suite, crash-safe run with honest scoring on a real x86 backend. Both tracks ship as runnable Smithers workflows you can drive end to end.- The feature-eval-factory workflow authors and optimizes the whole eval corpus.
evals/feature-eval-factory.tsxis a durable two-phase Smithers workflow. Phase 1 fans out one authoring agent perFeatureGroupparsed from.smithers/specs/features.ts(a rotated mix ofAntigravityAgent,CodexAgent, andClaudeCodeAgentwith array failover, where Antigravity only leads when theagyCLI is on PATH) that writes non-trivial, multi-feature eval tasks to per-group shards underevals/_inventory/generated/<GROUP>.jsonl, then a single merge step folds them intoevals/_inventory/curated-tasks.jsonland regenerates the per-suitecases.jsonlviaevals/harness/generate-cases.ts. Run it withbunx smithers-orchestrator up evals/feature-eval-factory.tsx --detach. - Phase 2 closes the loop by fixing docs until a weak model one-shots the tasks. A
<Loop>runs scoped suites against a weak candidate model (defaulthaiku), scores pass-rate plus one-shot-rate, aggregates the worst features and the agents’ own friction themes, and hands them to a fixer agent (Codex-led) that edits the root-causedocs/*.mdxand regenerates the bundles withpnpm docs:llms, iterating toward thetargetPass1.0 /targetOneShot0.99 goal. - The feature-coverage corpus now spans every feature group. The factory generated source tasks across the non-empty feature groups in
.smithers/specs/features.ts, expandingcurated-tasks.jsonland seeding new authoring and knowledge suites underevals/suites/(includingknowledge-gateway,authoring-workflows,authoring-agents, andauthoring-approvals). - SWE-EVO got a decoupled native-x86 scorer for emulation-incompatible instances. Some instances do not reproduce their gold patch under Docker emulation on Apple Silicon, so
examples/swe-evo/score-x86.tsprovisions a native x86 Linux + Docker VM on Freestyle, uploads the self-contained harness, and scores candidates (or the gold patch with--gold) one image at a time, pruning between images to respect the 32 GB plan cap. Agents still generate patches on the Mac (where the CLIs and auth live);extract-candidates.tspulls those diffs out of the run DB for re-scoring. A follow-up fix moved to one fresh VM per instance so a single oversized image can no longer exhaust the rootfs and poison later instances. - A rate-limit-aware suite supervisor runs the full benchmark to completion.
examples/swe-evo/run-suite.tssupervises at the instance level: it recomputes the remaining unscored targets from the DB each round (so it is crash-safe and idempotent), pauses with escalating backoff capped at the Claude 5-hour window on a global block, and drops an instance only after--max-instance-retries(default 3) genuine failures. Flags include--subset,--all,--concurrency, and--workflow panel|baseline. - A Panel + ReviewLoop orchestration variant measures what multi-agent structure buys.
examples/swe-evo/workflow/swe-evo-panel.tsxreplaces the flat two-agent baseline with a<Panel>of three parallel planners (Opus + Codex + Gemini) synthesized by an Opus moderator, then a<ReviewLoop>where Codex (gpt-5.5) implements against the plan while Opus and Gemini review until approved. Fairness is preserved: no agent sees the hidden tests or the gold patch. - Results compare against the official SWE-EVO leaderboard.
examples/swe-evo/report.tscomputes Resolved Rate and Fix Rate per repo and diffs runs against the public leaderboard inofficial-results.json(sourced from arXiv:2512.18470 Table 2), andmake-showcase.tsbuilds a self-contained HTML page of per-repo and per-instance coverage across all 48 official instances. The final showcase reports 34/48 scored and 23 resolved (67.6% Resolved Rate over the scored set), with native-x86 scores overriding non-reproducible Mac sentinels for the x86 instances.
Reliability & Correctness
Roughly 160 fixes landed this release. The notable, user-impacting ones, grouped by area:CLI, Review & e2e
A large fix pass this cycle hardened the parts users touch every day: clean shutdown and durable run state in the CLI, scoped action tokens and time-travel tools over MCP, an OIDC verification gap and atomic spend caps in the review proxy, and a rewrite of the fault e2e suite so the tests exercise real backends instead of fabricated stand-ins.- Runs now mark themselves
cancelledon Ctrl-C instead of dangling asrunning.setupSqliteCleanupin apps/cli/src/index.js had registered SIGINT/SIGTERM handlers that calledprocess.exitbefore the graceful abort path, so the run’sstatus:"cancelled"write never happened and the run sat atrunningwith a frozen heartbeat until its lease went stale. Those handlers are dropped (the processexitevent still closes SQLite), andsetupAbortSignalnow aborts gracefully, force-exits on a second signal, and force-exits via a 5s unref’d timer backstop so a hung shutdown never needskill -9. - The
--backend sqlitemigration remediation actually works now.--backendwas a registered option only onup/gateway/monitor/workflow, so read commands such asbunx smithers-orchestrator psandinspectrejected it as an unknown flag, which was the exact remediation the migration hard-fail told users to run. A newextractBackendFlagin apps/cli/src/argv-utils.js lifts--backend <value>(and--backend=value) out of argv for every command outside theNATIVE_BACKEND_COMMANDSset and setsSMITHERS_BACKENDso the store resolver honors it everywhere. bunx smithers-orchestrator gatewaykeeps stdout clean and exits 0 on shutdown. The long-running Gateway only resolved after SIGINT/SIGTERM, at which pointc.ok(result)printed a completion descriptor plus a skills CTA to stdout, breaking the empty-stdout contract thatautoStartGatewayrelies on (it polls/healthover HTTP and reports status on stderr). The shutdown path now callsprocess.exit(0)without emitting a trailing result frame.- Scoped action tokens can be brokered without breaking gateway auth. apps/cli/src/token-store.js gains
issueSmithersBrokerToken(issues a brokered handle alongside the main bearer, scoped to a singleactionId),resolveSmithersActionTokenFromStore(validates the handle’s scopes/expiry/revocation and returns the raw bearer without storing it at rest), andrevokeSmithersTokenwith a capped audit trail. The store bumps to version 2 withnormalizeStorefor safe forward/backward reads, and the CLI wiresbunx smithers-orchestrator token issue --actionId/--revealTokenplus a newtoken execsubcommand that resolves a handle and injects the bearer into a child process env var (defaultSMITHERS_API_KEY) (#321). - The review proxy meters the full SSE body and enforces spend caps atomically. apps/review/src/server/proxy/handleAnthropic.ts dropped a 1 MiB accumulation cap (
const cap = 1 << 20) inteeForMeteringso large streamed responses are fully read before usage is parsed, and apps/review/src/server/proxy/recordUsage.ts now does a conditionalUPDATE sessions SET spent_usd = spent_usd + ? WHERE hash = ? AND spent_usd + ? <= spend_cap_usdbefore inserting theusage_eventsrow, returningrecorded: falsewhen the cap would be exceeded so spend can never be double-counted. A pre-request check in handleAnthropic.ts still rejects with HTTP 402 when the session is already over its cap (#439). - OIDC verification no longer accepts a mismatched single-key JWKS. apps/review/src/server/sessions/verifyOidc.ts had a single-key fallback (
?? (keys.length === 1 ? keys[0] : undefined)) that used a JWKS key even when itskiddid not match the token header’skid, so an attacker who could rotate their own JWKS to contain exactly one key could bypass the kid check.verifyOidcnow strictly requiresk.kid === header.kidand returns{ ok: false, reason: "unknown-key" }when nothing matches (#386). - Server rerun loads input portably instead of via raw SQL. packages/server/src/gateway.js previously reran with a hardcoded
SELECT payload FROM input WHERE run_id = ?against a raw better-sqlite3 client, which broke for schema-backed workflows whose input table name differs and mishandled the payload envelope. It now derives the real table name viaresolveSchemafrom@smithers-orchestrator/engine, fetches the row withloadInputfrom@smithers-orchestrator/db/snapshot, and a newnormalizeRerunInputstripsrunIdand unwraps the payload envelope before starting the new run (#330). - The full time-travel surface is exposed over the semantic MCP, with typed errors preserved. apps/cli/src/mcp/semantic-tools.js and SemanticToolName.ts now register
fork_run,replay_run,rewind_run,restore_checkpoint,list_snapshots,get_timeline, andtime_travel(previously onlyrevert_attemptwas wired). Separately, the layer stopped re-wrapping typedSmithersErrors as genericINTERNAL_ERROR:executeSemanticToolswitchedEffect.tryPromiseto its two-argument{ catch: (error) => error }form so the original error passes through and codes likeWORKFLOW_MISSING_DEFAULT,RUN_NOT_FOUND, andCLI_DB_NOT_FOUNDreach the caller (#433, #409). - The
guishortcut routes to the Gateway UI, and replay surfaces VCS restore errors. Theguicommand and thebunx smithers-orchestrator <dir>shortcut now route to the Gateway UI viarunUiCommand, withrewriteGuiShortcutArgvusing astatSync.isDirectoryguard so only directories trigger the route (#435); and replay now surfaces VCS restore errors through a dedicated apps/cli/src/reportReplayResult.js that printsresult.vcsErrorinstead of leaving the user with a replay that silently did nothing (#388). - Fault e2e cases now drive real backends instead of fabricated stand-ins. The mock-SQLite and hand-rolled WebSocket fakes were replaced with real engine/Gateway paths: case03 starts a real Approval run and submits via the
/v1/rpc/submitApprovalHTTP endpoint (#412), case14/case17 boot a real Gateway for RPC round-trips and bad-HMAC webhook signatures, dropping ~1256 lines of mock infrastructure (#415), and case12 (rewind/VCS) and case26 (diff review) were rewritten through the real product path (#416, #417). case14 also learned to wait for the node to park atwaiting-approval(not just the approval row flipping torequested) to avoid a 500 race on submit. - Smaller fixes. Agent detection walks
PATHwithaccessSync(join(entry, binary), constants.X_OK)instead of shelling out to/bin/bash -c "command -v"(#444); the cron scheduler spawns the real entrypoint viaprocess.execPathand animport.meta.url-resolved absolute path rather thanbun run src/index.js(#422);snapshots/timelineJSON output is emitted throughwriteStdoutSync(so a piped reader never truncates it) and wrapped in a{ timeline:... }envelope, with a-jalias for--json(#427); the TUI re-polls pending human requests when only an approval is visible so a HumanTask routes to its real question ; OpenAPI tool generation importscreateOpenApiToolsSyncfrom the publicsmithers-orchestrator/openapientry (#441); apps/review/src/github/runGh.ts was made CI-deterministic with a checked-in executable apps/review/tests/github/fixtures/fake-gh,env: process.envpropagation,node:child_processspawnSync, and aSMITHERS_GH_BINoverride ; and the workflow-ui e2e resolves Playwright viarequire("playwright")from apps/cli’s own devDependency rather than the retired studio-2 POC.
Gateway, DB, OpenAPI & Control-Plane
0.25.0 lands 30-plus fixes across the durable control plane. The headline repairs are data-integrity ones inpackages/db and the cloud-sync path, plus trust-boundary hardening in the generated OpenAPI HTTP tools.
- Closed a concurrent sequence-allocation race in the event/signal log on the Postgres and pglite backends. The non-
bun:sqlitefallback inpackages/db/src/adapter.js(insertEventWithNextSeq/insertSignalWithNextSeq) readMAX(seq)theninsertIgnore’d atlastSeq+1with no serialization, so two concurrent writers collided on the(run_id, seq)primary key andinsertIgnoresilently dropped the loser, losing an event or signal from the ordering backbone that deterministic replay, live-stream tailing, and reconnect-after-seq all depend on. The fallback now acquires the same per-run transaction turn thebun:sqlitepath uses, making read-MAX-then-insert atomic; new tests assert 50 concurrent event inserts and 40 concurrent signal inserts stay gapless with no drops or dups. - User output schemas can no longer collide with reserved key columns. Output tables reserve
run_id/node_id/iterationand input tables reserverun_id, but schema fields were appended to the same namespace unchecked, so a field namednodeId,runId, oriterationeither crashed DDL with a raw “duplicate column name” (SQL path) or silently overwrote the internal NOT NULL key column and corrupted the composite primary key (drizzle path). A new context-awareassertNoReservedColumnsguard (packages/db/src/assertNoReservedColumns.js), wired intozodToTableandzodToCreateTableSQL, now fails fast with anINVALID_INPUTerror naming the offending field and suggesting a rename. - Electric live deletes are now applied instead of being silently dropped. An Electric
deletechange message carries only the primary-key columns ({ namespace, key }), nevervalue_json, butcreateElectricCollectioninpackages/gateway-clientrouted the delete value throughmapRow, which requires the non-PK columns and returnedundefined, so every live delete was dropped and the row was stranded in the collection forever (the cloud live tail never removed a Postgres-deleted fact). A new optionalgetKeyFromRawon the collection def derives the delete key directly from its PK columns;memoryFactssupplies one overnamespace/key. A companion fix reconciles the Electric snapshot onsnapshot-endrather than on everyup-to-dateframe. - OpenAPI generated tools no longer let the model overwrite injected auth, and non-2xx responses now surface as errors. Two trust and correctness bugs in the HTTP tool executor (
packages/openapi/src/tool-factory/_helpers.js): an LLM-supplied header parameter could override the integrator’s injected auth header (the model could overwriteAuthorization), and a non-2xx response was returned as a success result. Injected auth headers are now spread after the model’s header params so the operator secret always wins, and a non-2xx response throws an error carryingstatusand the response body (the error never embeds theRequestInit, so the injected header cannot leak). - Several more OpenAPI parsing and execution fixes.
buildUrlnow usesreplaceAllso a path template with a repeated param (e.g./orgs/{id}/aliases/{id}) substitutes every occurrence (#429); non-JSON request bodies are supported inbuildOperationSchemaand the executor (#431); request-body argument-name collisions with other params are resolved viagetRequestBodyArgNamein both schema and executor (#440);jsonSchemaToZodnow enforces numeric and typeless enums (#408) and validates typedadditionalPropertiesviacatchall(#420); and the curated-tools factory rejects duplicate tool names with a descriptive error instead of silently overwriting entries (#315). - Control-plane validation now rejects junk instead of storing it.
setUsageLimitandcheckUsageLimitinpackages/control-plane/src/index.jspreviously accepted any non-emptyperiodstring; a newUSAGE_LIMIT_PERIODSmap (daily/weekly/monthlyto ms) andusageLimitPeriodvalidator throwINVALID_INPUTfor unknown values, andcheckUsageLimitnow derives the window from the period (#397). Duplicate slugs map to a typedSmithersError(#393), usage and audit events validate their project (#394), and malformedmetadata_jsonis tolerated with a warn rather than crashing reads (#328). - Usage tracking no longer reports impossible numbers or hits the network with dead tokens.
parseAnthropicRateLimitHeadersandparseOpenAiRateLimitHeadersclamp the computed used count withMath.max(0, limit - remaining)so a burst window resetting mid-flight (remaining > limit) can no longer produce negative usage (#434), andclaudeOauthUsagenow checkscreds.expiresAtand returns a descriptive error before sending a request that would 401 with an expired token (#390). - Gateway run-tree traversal is cycle-safe and prompts resolve from the workspace root.
flattenGatewayRunNode(ingateway-client) now tracks visited node IDs in aSetand dedupes child IDs, andbuildGatewayRunTreewas extracted (ingateway-react) with cycle-safe recursion, guarding against infinite traversal on cyclic or duplicated node references (#436). Separately, the gateway resolves.smithers/promptsfrom a registeredGatewayOptions.workspaceRoot(stored asthis.workspaceRooton the gateway and used by thelistPromptsRPC) rather thanprocess.cwd, so launch modes whose cwd differs from the workspace return the right prompts; and the prompts collection is keyed byentryFilesofoo.mdandfoo.mdxno longer collapse to one id and drop a prompt. - Gateway React and client streaming reconnection is more robust.
useGatewayExtensionStreamclearserroronce a frame arrives after a failed attempt and makes its reconnect backoff abort-aware (via anAbortController) so an unmount during backoff resolves immediately; thegateway-clientgap-resync check now reads the canonical top-levelframe.event(notframe.payload.event) so bare replay frames are detected correctly (#419); a failed WebSocket open closes the socket before rejecting (#387); anduseGatewayApprovals/useGatewayRuns/useGatewayWorkflowsincludeparamsin their refetch dependencies so a param change actually refetches (#392). - Smaller fixes. A new
waiting-eventrun state inderiveRunState, backed byapproval-decided-resume-requiredandexternal-triggervariants onReasonBlocked, disambiguates runs parked on an awaited event from generic blocked states (#410); the Zod-to-DDL generator maps float numbers toREALcolumns instead of forcingINTEGER/TEXT(#312); the gateway RPC-contract test now types itsexpectedScopesmap asGatewayScopeso the scope assertions typecheck; eight duplicateisUniqueConstraintError/throwDuplicateSlugErrordefinitions (four copies of each) were collapsed to one of each in the control plane; and the@tanstack/dbversion is pinned (exact0.6.8) forgateway-client.
Agents, Scorers, Engine & Components
The 0.25.0 cycle hardened the durable spine. This section curates the ones that change correctness, data integrity, security, and debuggability across agents, scorers, the engine and scheduler, the react-reconciler, memory, and components.- Recording an error no longer corrupts the durable log. The engine writes failed tasks with JSON.stringify(errorToJson(error)) on its durable path, but errorToJson copied cause/details/context by raw reference, so a circular cause or a BigInt detail threw inside the error-recording code. errorToJson in packages/errors/src/errorToJson.js now runs its output through a cycle-aware sanitizer (BigInt to string, non-finite numbers to null, functions/symbols/undefined dropped, throwing getters skipped, cycles broken with a [Circular] sentinel, nested Errors serialized to
{name, message, stack,...}). toTaggedErrorPayload.js coerces TaskTimeout/TaskHeartbeatTimeout numerics (attempt, timeoutMs, iteration, etc.) to a finite fallback so the attempt/timeout values the retry/backoff logic reads back after a JSON round-trip are never NaN. - Scorer aggregation closed a SQL injection vector. aggregateScores in packages/scorers/src/aggregate.js interpolated filter values (runId, nodeId, scorerId) into SQL via an escapeSql helper. It now uses an addFilter helper that pushes column = ? placeholders and collects a params array (run_id, node_id, scorer_id). rawQuery in packages/db/src/adapter.js was extended to forward params to queryAllRaw (Postgres) and stmt.all(…params) (SQLite).
- Scores are clamped, validated, and replay-deterministic. packages/scorers/src/run-scorers.js had three durability bugs. NaN/Infinity/out-of-range scores poisoned DB aggregates and are now clamped to [0,1] and validated. A missing or non-numeric score silently dropped the row while the event log said finished, and it now throws SCORER_FAILED with no split-brain. Ratio sampling used unseeded Math.random so replay or fork could flip skip versus run, and it now derives the decision from a SHA-256 over the durable identity (runId, nodeId, iteration, attempt, scorerId).
- Scorer events now flow through the durable stream. The runner emitted ScorerStarted/Finished/Failed via the bare EventEmitter.emit, which notifies only live in-process listeners and skips the persist hop, so scorer events were invisible to
bunx smithers-orchestrator events, logs, tui, and the gateway even though scores landed in _smithers_scorers. They now route through emitEventWithPersist, falling back to a bare emit for a third-party bus that exposes only emit. - The scheduler tolerates stale completions for tasks that left the graph. taskCompleted/taskFailed in packages/scheduler/src/makeWorkflowSession.js failed the whole run when a completion arrived for a conditionally-rendered task whose parent re-rendered it out while it was still running, which discarded every other in-flight item in a fan-out run. A completion for a task no longer in the graph is now a stale no-op (record the output and re-decide on the current graph); a stale failure is ignored and re-decided.
- render rejects on uncaught errors and preserves all top-level children. In packages/react-reconciler/src/reconciler.js an uncaught throw during render rethrew out-of-band (via defaultOnUncaughtError) while render resolved with a stale partial graph, so callers got a corrupt result. render now captures the error synchronously through a custom onUncaughtError callback and rejects, clearing the field so the instance is not poisoned for the next render. Separately, appendChildToContainer/insertInContainerBefore/removeChildFromContainer overwrote a single container.root, silently dropping all but the last child of a multi-root Fragment; the container now keeps an ordered roots array (HostContainer.ts) and derives container.root from it, wrapping multiple roots in a synthetic smithers:fragment.
- A bad model id surfaces the real provider error instead of a JSON-schema error. When a provider rejects a request (for example a 404 for a bad model id) the AI SDK emits an error stream part and rejects the derived promises with a generic NoOutputGeneratedError. streamResultToGenerateResult in packages/agents/src/streamResultToGenerateResult.js dropped that part, so the engine misclassified it as the agent did not return valid JSON for the declared output schema. It now captures the first stream error part (and consumeStream’s onError) and re-throws the genuine provider error, so a bad model id fails with the real 404/APICallError.
- Memory summarization is idempotent and lossless, and token limiting actually runs. saveMessage in packages/memory/src/store/MemoryStoreLive.js was a bare insert that crashed on a UNIQUE/PK conflict during replay or resume; it now upserts on id via onConflictDoUpdate. The Summarizer (packages/memory/src/Summarizer.js) deleted old messages before saving the summary, so a failed save lost them; it now saves the summary first and deletes after, so the state where the old messages are gone and no summary exists is never reachable. The TokenLimiter and Summarizer processors, previously no-op placeholders, are now implemented atop new listThreadsEffect/deleteMessagesEffect store effects.
- Token-usage events attribute the real model id. TokenUsageReported in packages/engine/src/engine.js fell back to effectiveAgent.model, which is often unset for SDK agents and resolves to a random-UUID id for CLI agents, breaking per-model cost attribution and exploding the metric model tag cardinality. It now prefers the authoritative result.response.modelId and refreshes the agent span tag (attemptMeta.agentModel) from the same resolved id.
- Smaller fixes.
<Branch>now throws INVALID_INPUT (and types children?: never in BranchProps.ts) when JSX children are passed instead of silently dropping them (packages/components/src/components/Branch.js); SagaStep is exported as a value from Saga.js soimport { SagaStep }works; SuperSmithers uses stable output keys (super-smithers-read/propose/apply) and its apply path now instructs the agent to edit files on disk rather than only describe the changes; the scheduler decide depth guard (guarded at 10) returns a Failed result with SCHEDULER_ERROR instead of a silent Wait (#385); the engine stops advancing EventBus seq for DB-assigned events (#398); RESUME_METADATA_MISMATCH gained actionable remediation that prints the realbunx smithers-orchestrator fork <workflow> --run-id RUN_ID --frame <n>shape (orbunx smithers-orchestrator up <workflow>to start fresh); and observability’s withCorrelationContext is documented to run via Effect.runPromise/runFork (never runSync) to avoid wedging AsyncLocalStorage.
Testing & Quality
This release pushed hard on the repo’s no-mocks, real-backend testing philosophy: fault cases that used to seed raw SQL rows or hand-roll WebSocket fakes now boot a real engine and a real Gateway, the exactly-once SIGKILL-survival guarantee is finally proven across processes, and a broad unit-coverage sweep plus new CI gates lock in regressions that were previously discoverable only one slow round-trip at a time.- Real-engine SIGKILL/resume durability is now proven across processes. e2e/faults/case31-real-engine-kill-resume.test.ts spawns the engine in a separate OS process (e2e/harness/engineChildRunner.ts) against an on-disk sqlite DB, polls a sidecar marker to time the kill, SIGKILLs the real engine pid mid-node, then resumes the same run from the same DB in a fresh process. It asserts the exactly-once guarantee: the run reaches finished, each node commits exactly one output row, a node that committed before the kill is not re-executed, and the interrupted node re-runs and completes. This is the first time the #1 durability guarantee is verified against the real engine across processes (every prior crash case01-06 only simulated it with a stale heartbeat in an in-memory table), and a manually injected regression that deletes a committed checkpoint makes the test fail as designed.
- The gateway, approval, rewind, and diff-review fault cases were rewritten to drive the real product path. case14 and case17 dropped 1,256 lines of hand-rolled WebSocket/SQLite fakes to spin up a real Gateway and drive RPC round-trips (launchRun then submitApproval, plus a viewer-scope rejection) and bad-signature webhook payloads through the actual HTTP stack (#415). case03 replaced raw-SQL row seeding with createSmithers + runWorkflow, then submits the decision via the real /v1/rpc/submitApproval HTTP endpoint (#412); case12 drives the real rewind path via the time-travel jumpToFrame and rewindAudit APIs (#416) and case26 the real diff-review path via createSmithers + executeSandbox + getNodeDiffRoute (#417). These complete the no-mocks remediation in the fault-injection e2e matrix (epic 0022).
- Workspace coverage and new gate checks are wired into CI. scripts/coverage.mjs (run via pnpm coverage) produces a workspace coverage report uploaded as the coverage-workspace artifact in.github/workflows/ci.yml. The root test script now runs check-single-effect-version, check-dependency-boundaries, check-docs, check-llms, and the new check-smithers-test-script gate (scripts/check-smithers-test-script.mjs, which asserts every workspace that has runtime test files declares a test script) before pnpm -r —no-bail test, so a single CI run reports every failing package at once instead of bailing at the first.
- The dependency-boundary check now actually scans the e2e workspace. scripts/check-dependency-boundaries.mjs filesForPackage only looked under src/, so it scanned zero files for e2e (whose sources live in faults/, exports/, harness/). It now falls back to the package root when src/ is absent, which surfaced real undeclared imports (react, zod, smithers-orchestrator, @smithers-orchestrator/time-travel) that were only resolving via hoisting; those are now declared in e2e/package.json. The check passes for 38 packages and completes audit epics 0047 and 0052.
- A broad unit-coverage sweep closed branch-level blind spots across the published packages. New suites cover the engine, gateway, server, driver, scheduler, vcs, usage, sandbox, agents, review, observability, time-travel, gateway-client, gateway-react, and pi-plugin, targeting workflow-session decide and deadlock branches (packages/scheduler/tests/workflowSession-decision-depth.test.js and workflowSession-service-branches.test.js), agent stream-json interpreters (Antigravity, Gemini, Hermes, Vibe), and OTLP severity/correlation edge cases (apps/observability/tests/otel-severity.test.js and correlation.test.js). The gateway RPC scope contract test now pins getSchemaSignature and listDocs as run:read in packages/gateway/tests/rpc-contract.test.ts so new RPC methods cannot ship without a declared scope.
- A z.number fractional round-trip regression is now locked against a real column. Following the #312 fix that maps plain numbers to a SQLite REAL column (so 0.95 is no longer truncated to 0 by INTEGER affinity), packages/db/tests/db-output-roundtrip.test.js asserts getSQLType returns real and that 0.95 and 0.0123 round-trip losslessly through a real insert+select, so the fix cannot silently regress and users no longer need the z.string workaround.
- Real-CLI e2e tests were brought onto a single, honest skip convention. The agents e2e suites dropped the undocumented SMITHERS_REAL_CLI_E2E=1 opt-in guard, matching opencode-e2e and vibe-agent-e2e: they now skip only when the agent binary is absent or lacks required flags. Skipped e2e fault cases were promoted where possible and every remaining skip was given a tracking link, and the chat-create e2e was made to run hermetically by honoring OPENAI_BASE_URL in the OpenAI diagnostics path (packages/agents/src/diagnostics/getDiagnosticStrategy.js).