0.23.0
This release takes Smithers’ durable engine to a production-grade database, ships two brand-new products, and grows the agent and benchmark surface. The persistence layer now runs unchanged on PostgreSQL or an embedded PGlite through a SQL dialect seam, so the same crash-and-resume guarantees that worked on SQLite hold on Postgres. Alongside it: Smithers, a streaming Cerebras chat PWA deployable to Cloudflare in one command; UltraGrill, an open-ended real-time collaboration workflow with a live custom UI; asmithers usage command that reports consumed quota per account across every provider; four new benchmark harnesses (SWE-Bench Pro, SWE-EVO, Claw-Eval-Live, RoadmapBench) plus a defending-code vulnerability-discovery example; a Vibe (Mistral) agent and a corrected Antigravity CLI; new Studio 2 chat overlay surfaces; an interactive smithers init ceremony; and a launch article positioning Smithers as the open, durable orchestration layer. A correctness sweep hardens JSON extraction, DevTools stream recovery, agent schemas, and the test gate.
PostgreSQL & PGlite Persistence
The headline of 0.23.0 is a PostgreSQL and PGlite dialect for the persistence layer (packages/db/src/dialect.js). Smithers’ storage layer is hand-written SQL, and a new dialect seam lets that exact SQL run on SQLite or Postgres without rewriting a query. The dialect handles placeholder translation (? to $1), DDL type and autoincrement mapping (INTEGER to BIGINT, BLOB to BYTEA, AUTOINCREMENT to BIGSERIAL), information_schema introspection, BEGIN vs BEGIN IMMEDIATE transaction semantics, and json_extract to ->> rewriting.

-
createSmithersPostgresboots node-postgres or an embedded PGlite. A new API stands up the engine against a real Postgres server or an in-process PGlite over a local socket, so you can develop against an embedded database and deploy against managed Postgres with the same code.pgandpgliteare optional dependencies, and the synchronousbun:sqlitepath is untouched, so existing SQLite users see no change. -
The dialect is exercised end to end on real PGlite. New test suites boot an in-process PGlite and run the db dialect, the engine, and the time-travel fork/snapshot paths against it (
packages/db/tests/db-postgres-dialect.test.js,packages/engine/tests/create-smithers-postgres.test.jsx,effect-builder-postgres.test.js,time-travel-postgres.test.js), so the snapshot, replay, and branch-listing Effects are verified on the new dialect rather than only on SQLite.
Smithers: a Cerebras Chat PWA
This release introduces Smithers (apps/smithers), a streaming chat PWA powered by Cerebras gpt-oss-120b through TanStack AI and deployable to Cloudflare via Alchemy infrastructure-as-code. The browser streams replies from a Cloudflare Worker over Server-Sent Events and never holds the API key; the Worker runs Cerebras server-side with the key bound as a Worker secret. bun dev at the repo root now boots this app (the old Studio 2 dev script moved to bun dev:studio).

-
Streaming chat with a command menu, voice dictation, and a grill-workflow graph. The frontend renders streaming Markdown replies, a Cmd-K command menu, toast notifications, voice dictation, and an “ask me” grill workflow graph drawn with ReactFlow. The chat backend is a single Cloudflare Worker route (
POST /api/chat) that runs Cerebras through TanStack AI’schat()and returns a Server-Sent-Events response. -
Flash-free dark mode tokenized end to end. Every color in the app is a semantic CSS custom property defined in
:root, and dark mode ships two ways: the OS preference via@media (prefers-color-scheme: dark)and an explicit[data-theme]override. An inline script resolves the theme before first paint so there is no flash,color-schemekeeps native controls and scrollbars matched, and the workflow graph follows via ReactFlow’scolorMode. -
Renamed and rebranded. The app moved from
apps/search-pwatoapps/smithers(package@smithers-orchestrator/smithers), and every remaining “Huey” string was rebranded to “Smithers” across UI text, the manifest, the Ask Me system prompt, the Alchemy app id, and thelocalStoragekey prefix. The Alchemy deploy ships the Worker plus the built PWA as static assets with SPA routing.
UltraGrill: Real-Time Collaboration
UltraGrill is a new durable, open-ended collaboration product built on real Smithers primitives and tested end to end. The workflow (workflows/ultragrill.tsx) runs two concurrent planes in one never-ending run: an intake <Loop> of <WaitForEvent event="utterance"> that the UI feeds via signals, and dynamic worker dispatch that spawns one <Task> per directive, carries out the work, keeps a living Markdown spec in sync, and emits rolling clarifying questions. The run stays open until an end utterance arrives.
-
A live custom UI over real gateway hooks.
ui/ultragrill.tsxis a first-class Gateway UI: a composer that turns text into anutterancesignal, a conversation feed with live worker activity, the living-spec Markdown panel, and the question pool, all driven by the realgateway-reactrun/event/node-output hooks over live RPC. The dynamicgateway.tsdiscovers every.smithers/workflows/*.tsxand mountsui/<key>.tsxby convention, isolating a broken workflow so it disables only itself. -
One-command launcher.
bun .smithers/scripts/ultragrill.ts ["goal"]boots the gateway, starts an open-ended session, opens the UI in the browser, and stays alive until Ctrl-C. -
Composed from reusable components, verified by a real e2e. UltraGrill is one workflow assembled from two components:
VerifiableGoalsturns a proposal into independently shippable tickets on disk, andShipTicketsdiscovers the queue and ships each ticket through research, plan, implement, validate, and review in its own git worktree before committing and merging. The e2e executes the real workflow to completion (the repo’s fakeclaudeagent is the only stub) and then drives a headless browser to assert the real UI renders that real completed run. The generic decompose-to-ship pipeline was renamedship-pipelineto free theultragrillname for the product.
smithers usage: Per-Account Quota
A new smithers usage command and a packages/usage engine report how much rate limit or subscription quota each registered account has consumed, normalized to one UsageReport / UsageWindow model. The command supports --account and --provider filters and --fresh to bypass the short usage cache, renders a human-readable table to stderr, and emits a structured envelope to stdout with --format json, matching smithers agents list.
- Real adapters per provider, credentials never leave the host.
getAccountUsagedispatches per provider:claude-codereads the OAuth usage endpoint,codexreadswham/usage, andanthropic-apiandopenai-apiread live rate-limit headers. Credentials are read host-side from each account’s config directory (Claude.credentials.jsonplus the macOS Keychain, Codexauth.jsonplus a JWT account id) and never leave the process. Google providers and Kimi reportsource:"none"honestly until local-estimate accounting lands.getUsageForAccountsfans out in parallel through an on-disk cache with a hard 180s floor forclaude-code, whose usage endpoint rate-limits aggressively.
Benchmarks & Examples
This release adds four self-contained benchmark harnesses and a vulnerability-discovery example, each driving real Smithers workflows.-
SWE-Bench Pro (
benchmarks/swe-bench-pro) runs ScaleAI’s SWE-Bench Pro end to end via a Smithers workflow where Opus 4.8 implements and Codex reviews, scoring against ScaleAI’s Docker images with gold and empty-patch integrity gating. A new Benchmarks section was added to the root README. -
SWE-EVO (
examples/swe-evo) is a benchmark harness with a dataset loader, a Python scoring harness, the workflow plus prompts, and a gold verifier, also pairing Opus 4.8 with Codex through the gateway. -
Claw-Eval-Live (
benchmarks/claw-eval-live) is a live evaluation harness with a mixture gateway, a Docker sandbox, batch and one-shot runners, and result aggregation. -
RoadmapBench (
benchmarks/roadmapbench) is a bash and Python harness (prepare, validate, launch, collect, score, audit) with a companion.smithers/workflows/roadmapbench.tsxworkflow. -
defending-code (
examples/defending-code) ports Anthropic’s vulnerability-discovery reference harness to a durable Smithers workflow: build, recon, find, verify, dedupe, report, patch, with AddressSanitizer crashes as the execution-verified signal. It runs onClaudeCodeAgent(subscription auth), and a follow-up hardened it against fan-out id collisions and silent build failures so a broken build fails the run loudly instead of reporting a misleading all-zeros success.
Agents
- Vibe (Mistral) CLI agent. A new
VibeAgent(packages/agents/src/VibeAgent.js) drives Mistral’svibeCLI with stream-json output parsing, session resume, and the standard build-command flags, plus CLI detection viaVIBE_HOME/MISTRAL_API_KEY, a capability-registry entry, and a surface-manifest entry.

-
Antigravity (
agy) launches correctly.AntigravityAgentwas forked from the Gemini agent and inherited flags the Antigravity CLI rejects, soagyexited immediately on an unknown flag. The flags now match the real CLI surface:--include-directoriesbecomes--add-dir,--resumebecomes--conversation=<id>,--output-formatis dropped (the json/stream-json value only selects how Smithers parses stdout), and the nonexistent--screen-reader,--debug,--list-sessions,--delete-session, and--extensionsflags are removed. A regression test asserts the emitted args never use the removed flags. (Closes #202.) -
Strict Codex output schemas. OpenAI and Codex structured output requires every object node to set
additionalProperties:false.sanitizeForOpenAInow coerces loose and passthrough objects to strict instead of leaving sub-schemas open or coercing an empty{}totrue, soz.looseObjectoutput schemas are accepted by Codex. - Hardened CLI agent surface contracts (#203) keeps the built-in adapter capability report stable across the new agents.
CLI, Studio & Workflows
-
Interactive
smithers initceremony.smithers initnow reports scaffold and install progress through anInitReporterhook oninitWorkflowPackand renders tailored next-steps after the pack lands, replacing the raw stderr skip lines with guidance on what to run next. -
Studio 2 chat overlay surfaces. The chat-first shell’s overlay system gained a Views menu and a set of surface overlays: dashboards (issues, runs, scores, search, triage, memory, workflows), tag filtering, a toast stack, a settings overlay, and a resizable split divider, all wired through the overlay store, slash actions, and sidebar. Studio 2’s product, design, and engineering specs were also added under
apps/smithers-studio-2/docs/. -
The
Task forkprop is now in the public types.<Task fork="...">already worked at runtime, but the hand-maintained components type declaration omitted the prop, so TypeScript rejected it.TaskPropsnow declaresfork?: stringwith a doc comment and a JSX type-test.

Docs & Messaging
-
Open, durable orchestration launch article. A new article (
docs/why/durable-open-orchestration.mdx, nav under “But… Why?”) positions Smithers as the open, durable orchestration layer beneath any agent topology, with supporting diagrams and GIFs. -
Harness setup guide.
docs/agents/setup.mdxwalks through wiring Smithers into an existing coding agent via the paste-a-prompt path or the three-command manual path, linked from the agents overview, the installation page, and the docs nav. -
Copy page / View as Markdown menu. Every docs page now renders the Copy page button with Copy, View as Markdown, Open in ChatGPT, and Open in Claude. The
.mdandllms.txtbacking already shipped; only the UI affordance was missing. -
Version-matched
smithers docs. Thedocsanddocs-fullcommands now serve the llms documentation bundle that ships with the installed CLI version (apps/cli/docs/llms.txt,llms-full.txt) rather than always fetching the latest fromsmithers.sh, so the docs an agent reads match the binary it is driving. Ascripts/check-llms.mjsgate keeps the bundled copy in sync. (#220) -
Reframed messaging around agent-authored workflows. The README and docs landing, intro, and starters copy now lead with the idea that your agent writes the workflow via an installed skill rather than you writing it by hand. The usage command, the
@smithers-orchestrator/smithersand@smithers-orchestrator/usagepackages, theui/guicommands, the Antigravity flags, and the fourTASK_FORK_*error codes were all documented to keep the docs-coverage gate green.
Reliability & Correctness
-
Extracted the outer final JSON object after prose.
extractLastBalancedJsonscanned backward from the last{, which returned an inner nested object when the final JSON itself contained nested objects. It now picks the balanced object that ends latest (the true outer object), with the helpers moved topackages/engine/src/json-extraction.jsand a regression test. -
Recovered DevTools stream gaps when route errors are wrapped.
streamDevTools’ gap recovery detected a missing frame viaerror instanceof DevToolsRouteError, butcaptureSnapshotran the route throughrunPromise(), which re-threw every failure as aSmithersErrorwith the originalDevToolsRouteErrorburied in itscausechain, so theinstanceofchecks silently stopped matching and the stream threwFrameOutOfRangeinstead of re-baselining.captureSnapshotnow unwraps the cause chain and re-throws the original error, restoring all three gap-recovery paths. -
Stopped a global module mock from hanging the suite. A timer test mocked
node:child_processviamock.module(), which is process-wide under bun’s concurrent runner and bled into other files (and is not undone bymock.restore()), hanging anything that spawned and timing outpnpm release.runRpcCommandEffectgained aspawnFninjection seam (defaulting to the realspawn) so the test passes its fake child through dependency injection without touching global state. -
Matched the canonical duplicate-id error wording. The legacy-mode duplicate-task test asserted the stale message “Duplicate static id detected” while the canonical
SmithersErrortext is “Duplicate Task id detected.” The run correctly fails; only the assertion was out of date. -
Tolerated transient SQLite lock contention while polling a detached run. The read-only poll in the detached-runs test could hit
database is lockedoris busymid-write from the detached child, not justno such tableduring early init. The poll loop now treats all three as transient and keeps polling, so a lock race under CI load no longer fails the test. -
Type declarations and checks. Regenerated the type declarations and removed the architecture line-budget check, and synced the workflow-pack tests with the shipped
uifeature.