Documentation Index
Fetch the complete documentation index at: https://smithers.sh/llms.txt
Use this file to discover all available pages before exploring further.
0.22.0
Major release: Smithers ships Smithers Studio 2, a ground-up dark agent operations console with three-tier progressive-disclosure navigation, a Cmd-K/Cmd-P command palette, a color-equals-run-state design system, and a full suite of live data surfaces — Runs, Workflows, terminal, chat, JJHub, and DevTools — all wired to real Gateway and workspace-API backends. Studio 2 now defaults to a chat-first shell — one conversation that drives issues, runs, PRs, and workflows, with the tabbed shell one toggle away. Alongside Studio 2, this release de-mocks every e2e test onto real seeded backends, adds a live Gateway Run Chronicle plus init-pack-generated per-workflow UIs and a newsmithers ui command, grows the CLI with a starter gallery and GEPA-style prompt optimization, and ships a drop-in agent-facing smithers skill so a coding agent can drive Smithers without reading the whole docs site. A focused security pass closes XSS, local-RCE, path-traversal, DoS, and auth-bypass vectors, and a broad correctness sweep hardens the engine, time-travel, DB, graph, and observability stacks.
Smithers Studio 2
The headline of 0.22.0 is Smithers Studio 2 (apps/smithers-studio-2), a ground-up rebuild of the agent operations console. The original “spaceship” studio put all ~25 views on one flat sidebar with equal weight, and color everywhere meant nothing stood out. Studio 2 inverts that: a dark, near-monochrome console where saturated color only ever signals run state, and where the surfaces you reach for daily are the only ones permanently visible. It is built on Vite, React 19, and Zustand, and ships as a web app today with a desktop shell in the wings. The design and information-architecture contracts are written down first, in apps/smithers-studio-2/docs/DESIGN.md and docs/UX.md.

-
Three-tier progressive-disclosure navigation. The shell exposes exactly four primary surfaces that are always visible — Home, Runs, Workspace, Workflows — backed by a single nav registry in
src/shell/navRegistry.tsx. Six secondary surfaces (Issues, Landings, Workspaces, Memory, Scores, Search) live in a collapsed-by-default “More” group, and three developer surfaces (DevTools, SQL Browser, Logs) hide entirely until opted in. Every surface registers exactly one entry inbuildNavRegistry, so a new view appears in the right sidebar section and the palette automatically without touchingAppShellorSidebar. -
Registry-level developer-mode gating. Developer surfaces are gated by conditional construction, not CSS. When the persisted
developerModeflag (localStoragekeystudio.developerMode) is off, the developer items are simply not added to the registry, making them unreachable by sidebar, command palette, and deep-link — the sidebar is byte-for-byte identical to a non-developer session. Toggling the flag off while sitting on a developer surface falls the user back to Home rather than stranding them on an unregistered route. -
WELCOME → FOCUS → DETAIL altitude model. The app has three navigational altitudes you can always locate yourself in. Home (
src/home/Home.tsx) is the WELCOME screen — a calm centered column with two verbs (Open Folder, launch a workflow), a recent-workspaces list, and a live Operations strip; opening a folder routes to Workspace (FOCUS) and launching a workflow routes to Runs. DETAIL always lives in an inspector pane inside a surface, never as a new top-level view, which is how the old runs/snapshots/approvals/scores/logs nodes collapse into a single “what are my agents doing” surface. -
Cmd-K / Cmd-P command palette. A universal accelerator (
src/shell/CommandPalette) reaches every registered surface across all three tiers plus contextual commands, so the rail can stay tiny without trapping power users. Prefix pills switch modes —>commands,/run workflow,@open file,?ask AI — parsed byparseQuery.ts, and results are grouped by section (“Go to”, “More”, “Developer”, “Commands”). Developer surfaces only surface in the palette when developer mode is on, since the palette is built from the same gated registry.

-
Dark design system where color equals state. All visual tokens live as CSS custom properties in
src/theme.cssand are mirrored as a single TypeScript export insrc/theme/themeTokens.ts, so no component hardcodes a hex value. Three stacked surfaces (--bg,--surface-1,--surface-2) carry depth, text is white at three opacities, and the only saturated colors are run-state signals:--accentblue for live/running,--successgreen for completed/approved,--warningamber for waiting/pending-approval gates, and--dangerred for failed/denied. Motion is limited to 120–150ms ease-out feedback on real state changes and collapses to zero underprefers-reduced-motion. -
Zustand store as the single shell state.
src/useStudioStore.tsholds the whole top-level shell state — active view, developer mode, command-palette open/query/selection, and terminal tabs — with detail state colocated in each surface’s folder rather than the global union. The store keeps a backwards-compatibleterminal→workspaceview alias so existing hotkeys and tests that target the oldterminalid keep resolving after the terminal moved inside the Workspace surface. -
Global hotkeys and a runs badge. The shell owns global hotkeys via
useHotkey(src/useHotkey.ts): Cmd-P and Cmd-K open the palette, Cmd-T opens a new terminal. The Runs nav row carries an unread approvals badge (runsBadgeStore) so time-sensitive gates surface without digging, consistent with the IA rule that anything that is a state of a run is disclosed inside the Runs surface rather than as a sibling nav node. -
Electrobun desktop packaging direction. An Electrobun shell (
electrobun.config.ts,electrobun/) wraps the already-built Vite output rather than rebuilding it:dist/is copied verbatim as themainviewview and a Bun main process opens a single 1280×820 window pointed atviews://mainview/index.html, with aSMITHERS_STUDIO_DEV_URLoverride that points the webview at the Vite dev server for HMR. It installs a minimal native macOS menu (app/quit, Edit roles so copy-paste works in the webview, Window controls). The Vite web app stays the source of truth — the desktop shell adds no UI of its own.
Chat-First Shell
The default Studio 2 surface is now a chat-first shell (src/chat): one long conversation with the agent is the whole app, replacing tab navigation. Rather than clicking between surfaces, you tell the agent what you want and it manages issues, runs, PRs, workflows, and sandboxes for you, showing rich data inline through a sandboxed-HTML tool or in an overlay that can split beside the chat or sit full-screen over it. Slash commands map to Smithers CLI features and open a default UI as an overlay — surface overlays reuse the existing real Studio surfaces verbatim, and the terminal overlay reuses the real Ghostty PTY.
The switch is non-destructive: the classic tabbed shell (shell/AppShell) is untouched and one toggle away via the /studio command or the project-bar gear, gated by a new shellMode flag in the studio store that App switches on. Concepts the backend does not model yet — projects, per-message tags, the chat feed, the agent HTML tool, and overlays — are fed from typed seams with mock implementations behind real-ready interfaces (grep SEAM:), while everything that already has a backend (gateway, workspace API, the reused surfaces, the PTY) is wired to it directly. Pure helpers (parseSlash, resolveSlashAction, tagColor) are split out and unit-tested, and the shell has real-backend e2e coverage.
Studio 2 Data Surfaces and Workspace
Behind the shell, every surface is a real, live view over the Gateway and workspace-API backends. Here is what each one does.- Live Runs inspection surface. The
Runsview pairs a run-history rail (with an approvals filter and a live nav badge) against a responsive tree-plus-inspector layout that splits with a draggable divider above 800px and falls back to a modal inspector sheet on narrow widths. It streamsrun.eventframes over the Gateway’s real WebSocket protocol (aconnect+streamRunEventssubscription routed through/v1/rpc), debouncing event bursts into a singlegetRun+getDevToolsSnapshotrefresh of only the selected run so a chatty run never hammers the list RPCs. TheRunToolbarexposes the lifecycle actions valid for the current state — cancel a live run, resume a terminal one — plus a frame scrubber for time-travel rewind.

-
Inline approval gates. When a selected node has a pending approval, an
ApprovalGaterenders directly inside the inspector with an optional note and approve/deny buttons that postsubmitApproval. Approvals are surfaced three ways from one source of truth: the history-list filter, the inline gate, and the sidebar badge count driven byrunsBadgeStore. -
Embedded workflow UIs. A run whose workflow ships its own custom UI defaults to rendering that bundle in an iframe via
WorkflowRunUi, scoped with?runId=<id>; the studio host proxies/workflows/*,/v1/rpc, and the run-event socket so the embedded UI boots same-origin. A toolbar toggle swaps between the custom UI and the generic tree/inspector. -
Workflows launcher. The
Workflowssurface browses launchable things across Local, Remote, Prompts, and Schedules segments, shows a selected workflow’s source and summary, and launches it with arguments. When a workflow declares typed launch fields it renders one control per field (text/number/boolean/JSON) with inline validation, otherwise a freeform JSON textarea. Launching creates a run and hands off to the Runs surface with the new run auto-selected. -
In-app workspace terminal and agent chat. The
Workspacesurface is segmented between a Ghostty-rendered terminal and an agent chat. Each terminal pane owns its ownGhosttyCoreWASM instance keyed by tab id, and the core is evicted on tab close so WASM linear memory is reclaimed instead of leaking on every open/close. Terminals are backed by a JSON-RPC-over-WebSocket PTY server (scripts/pty-server.ts) with scrollback replay on attach, a concurrent-session cap, frame-size limits, and an orphan reaper. The agent-chat segment renders streaming markdown/code responses with a model/mode indicator and a calm auto-scroll that only pins to the newest block when the viewer is already near the bottom; it stays mounted across segment switches so history is preserved.

-
Home welcome and operations strip.
Homeis a calm welcome column with recent workspaces and a live Operations strip that derives running / waiting / pending-approval counts from the samelistRuns+listApprovalsRPCs as the Runs surface, polled on a lighter 5s cadence. When the workspace backend is unreachable it shows a connect/boot panel instead of empty recents. -
JJHub issues, landings, and workspaces. The
IssuesPanellists, filters, opens, creates, and closes/reopens JJHub issues; theLandingsPanelshows a landing’s info, diff, checks, and conflicts in tabs and supports review and land actions; theWorkspacesPanelmanages cloud workspaces (create, suspend, resume, fork, delete) and their snapshots alongside local workspace recents. All three read JJHub auth status and route through the realjj/jjhubCLIs behind the workspace API. -
Memory, Scores, and Search.
Memoryqueries cross-run memory facts by namespace/key with a debounced search;Scoresshows aggregate and per-run scorer results;Searchis a global workspace search with a Code / Issues / Repos / Transcripts scope control, request-generation guarding to drop stale responses, and serves as the full-page fallback for the palette’s search mode. Each is wired to the real/memory,/scores, and/searchworkspace endpoints. -
DevTools, SQL Browser, and Logs. Behind the developer toggle,
DevToolsrenders the raw, unfiltered DevTools snapshot tree and node props for a run picked fromlistRuns; theSqlBrowseris a read-only query surface over the workspace SQLite database that lists tables, shows a selected table’s schema, and runs ad-hoc queries;Logsis a dense, filterable event-log firehose (by level, category, and free text) with error/warning counts.

Studio 2: Real Backends, De-Mocking, and Hardening
-
bun devnow boots a real Gateway and real workspace API. The Studio 2 dev stack runs an actual Gateway (server/startGatewayServer.ts) and workspace-API backend (server/createWorkspaceApiServer.ts,server/workspaceBackend.ts) against the real workspacesmithers.db— wiring real SQL, scores, memory, chat-agent, andjj/jjhuboperations instead of fabricated responses. No mocks remain in the product runtime path. Seeded fixtures are now strictly opt-in behind theSMITHERS_DEV_USE_FIXTURESflag for both the gateway and workspace API. -
The e2e suite drives real backends with seeded data, with no route mocks. Playwright specs across every surface (runs live-inspection, rewind/resume, home, command palette, SQL browser, DevTools tree, terminal lifecycle, landings, chat) now exercise the real Gateway and workspace backend rather than
page.route/routeWebSocketstand-ins, including a real read-only SQL exec path that genuinely rejectsDELETE. Per-testx-studio-sessionisolation makes the parallel suite deterministic, and Playwright is restricted to*.spec.tssobun*.test.tsunits aren’t node-loaded by the browser runner. -
Live run events ride the real Gateway WebSocket protocol. Run inspection now connects and subscribes over
/v1/rpc(through the vite ws proxy) using the real gateway WS protocol instead of a dead raw socket, andstreamRunEventsResilientreconnects on a silent1006close. The poll/event refresh paths were deduped so a live run refreshes only the selected run rather than hammering the gateway with full-list refetches every tick. -
Stale-response races eliminated across async surfaces.
useGatewayRpcnow tags eachrefetchwith a generation counter and discards responses from superseded requests, so a slow earlier call can no longer clobber fresher data.LandingsPanelguards its diff/checks/conflicts loaders against a previously-selected landing’s late response overwriting the current one, anduseGatewayRunEventsclears its events array whenrunIdchanges so switching from run A to run B no longer shows A’s events followed by B’s. -
Unmount paths settle and prune their state.
usePtySessionnow explicitly rejects all pending RPC promises with a “terminal unmounted” error before clearing the map, instead of relying on the asyncws.close()oncloseto settle them (which fired after the map was already empty, leaking in-flight calls). The scheduler’spruneUnmountedloop now also deletes stalestate.approvalsandstate.retryCountsentries, closing a hot-reload bug where a re-mounted task silently bypassed its human-approval gate and inherited an exhausted retry budget. -
Gateway auth and registry hardening.
packages/server/src/gateway.jsreceived four surgical fixes:startRunnow deletes itsrunRegistryentry on completion (it previously leaked one record per run forever); token-mode auth guards the lookup withObject.hasOwnso prototype keys like__proto__ortoStringreturn 401 rather than crashing with a 500;verifyJwtTokennow rejects tokens with a missing or non-numericexpclaim; and the WS error handler echoes the realframe.idinstead of hardcoded"invalid"/"server"ids. -
Client-side security fixes. The PTY WebSocket server now rejects cross-origin upgrades (gating on a loopback
Origin) to prevent a drive-by page from spawning a local shell; agent/LLM markdown links are sanitized to allow onlyhttp/https/mailto/relative hrefs so ajavascript:URL can no longer execute on click under React 19; and the workspace API caps request bodies at 32MB (streaming, then HTTP 413) to prevent a memory-exhaustion DoS while still admitting a base64 screenshot envelope. -
Round-3 UX, accessibility, and robustness polish. Across all surfaces: run-tree keyboard navigation with roving focus, toolbar lifecycle tooltips, clearer empty states, calmer chat auto-scroll that only pins when near the bottom, ARIA
status/alertroles on chat/terminal/approval and combobox/listbox roles on the command palette, SQL read-only messaging, and a***bold-italic***markdown fix. The 2-second live-run poll was deduped so identical ticks no longer churn the tree and reset expansion state. -
Reliability.
GET /memoryand/scoresno longer drop their default limit (a missing?limithad coerced toNumber(null) === 0, returning a single row); theApprovaldecision output now populatesdecidedAtfromdecidedAtMsand evaluatesautoApprove.condition/revertOncallbacks exactly once per render; Claudestream-jsontoken counting no longer double-counts the terminalresultevent’s usage; the LLM-judge scorer parses balanced-brace JSON so a brace inside a reason no longer forces a score of 0; RPC total/inactivity timers are cleared on all settle paths; the bash network guard matches blocked tools as whole tokens (soecho bundle.jsis no longer rejected); the reference gateway’s token store and websocket client message handlers guardJSON.parseto fail secure on corrupt input; the workspace API surfaces a clean gateway-empty error instead of a rawTypeErrorwhen the gateway answers a request with a 200 and an empty body; the cloud-workspace flow clearsrestoreSourceSnapshotIdwhen the Restore modal is cancelled or a plain New Workspace modal is opened, so aRestore → Cancel → New Workspace → Createsequence no longer silently restores an intended-empty workspace from a stale snapshot; andstallSandbox.release()now awaits its filesystem restore.
Gateway Console & Workflow UIs
-
Added the Run Chronicle to the Gateway console. Selecting a run in the operator console (
packages/server/src/gatewayUi/defaultOperatorUi.js) now opens a live, three-pane detail view: a DevTools Tree of the workflow’s nodes, a Chronicle event log, and an Inspector showing the selected node’s props, output, and filesystem diff. The console opens its own Gateway WebSocket and subscribes tostreamDevToolsandstreamRunEvents, applying incremental DevTools snapshot/delta ops (addNode,removeNode,updateProps,updateTask,replaceRoot) to keep the tree current and prepending run events to a bounded chronicle. Node output and diff are fetched on demand viarpcSocket("getNodeOutput")/getNodeDiff. Live status pills surface the DevTools and event stream state (Live,Retrying,Resynced,Needs refresh), the current frame/seq, and heartbeat age; stream failures retry with backoff and stale generations are discarded so switching runs never crosses wires. The newselectedRunIdand per-run stream state reset cleanly between selections. -
Attributed runs correctly across workflows sharing a database. The Gateway (
packages/server/src/gateway.js) now caches oneSmithersDbadapter per underlying database — iterating each DB once rather than once per registered workflow — and attributes each run to its true workflow via the run’sgatewayWorkflowKeyconfig (falling back to the row’sworkflowName). This prevents cross-workflow runs on a shared project DB from being misattributed to whichever adapter happened to find the row, which is what makes a single Gateway hosting many workflow UIs over one DB report each run under the right workflow. -
Scaffolded a bespoke React UI for every init-pack workflow.
smithers initnow emits a.smithers/ui/<key>.tsxstandalone React app for each workflow (apps/cli/src/workflow-pack.js, sources inapps/cli/src/workflowUiSources.js), coveringplan,implement,research-plan-implement,review,research,ticket-create,tickets-create,ralph,improve-test-coverage,debug,grill-me,write-a-prd,feature-enum,audit,mission,workflow-skill, plus dedicatedkanbanandplanrenderers. Each UI is built on thegateway-reacthooks (useGatewayRuns,useGatewayRunEvents,useGatewayNodeOutput,useGatewayActions,useGatewayApprovals) and renders a live, workflow-specific view — for examplereview.tsxshows a per-reviewer approval verdict bar driven by the run’s realreview:Nnode outputs. The generated.smithers/gateway.tsmounts each workflow and its UI independently via aUI_WORKFLOWStable, logging a/workflows/<key>mount URL; a workflow that fails to import (e.g. broken MDX) disables only its own UI while the rest of the Gateway stays up.

-
Added
smithers ui [runId]to open a run’s workflow UI. The new command (apps/cli/src/index.js) resolves a run to its workflow and the Gateway’suiPathover/v1/rpc(listWorkflows,listRuns,getRun), then opens<gateway><uiPath>?runId=<id>in the browser. With norunIdit attaches to the most recent run;--workflowopens a workflow UI directly,--gateway/--porttarget a specific Gateway, and--no-openjust prints the URL. It fails clearly when no Gateway is reachable or the resolved workflow has no UI mounted. -
Covered the workflow UIs with a real-browser e2e harness.
apps/cli/tests/workflow-ui-all.e2e.test.jsruns a realsmithers init, boots the generatedgateway.tsas a live server, and loads all 16 UIs in headless Chromium (no route mocking) — every UI is asserted to build, boot, and mount, and the deterministic subset (driven by descriptors inworkflow-ui-descriptors.json) is executed for real and asserted to render that run’s actual node output. This is the check that caught awrite-a-prdMDX failure and a Gateway-wide crash. A companionsmithers uie2e verifies the command opens the right run’s UI. -
Reliability.
useGatewayRunEventsnow clears its events array whenrunIdchanges so switching runs no longer concatenates the previous run’s events;useGatewayRpcdiscards stale responses via a per-request generation counter so a slow earlier request can’t clobber fresher data; and thepi-pluginSSEevents()generator now releases its stream reader on throw, return, or consumer break and guardsJSON.parseso a single malformed frame is skipped rather than aborting the stream. The Gateway WebSocket frame parsing in the console is likewise guarded against invalid frames, and the run-chronicle prototype was hardened against XSS by building DOM nodes withtextContentinstead ofinnerHTML.
CLI, Evals, Agents & Workflow Tooling
-
<Task fork="task-id">— immutable agent session forking. A newforkprop on<Task>starts an agent task from a copy of another task’s final session context.forkadds the source as an implicit dependency (the forked task waits for it to complete), composes withdependsOn/needs/deps/Sequence/Parallel/Branch/Loop, and copies the source’s persisted conversation snapshot into a fresh, independent session before submitting the new prompt — the source is never mutated and its native session id is never reused. This enables follow-up chains (plan → implement → verify), parallel branches that all start from the same base context, and reforking (a forked task can itself be forked). Inside a<Loop>,forkresolves to the latest completed snapshot for that task id. New validation surfaces clear typed errors for invalid fork sources:TASK_FORK_SOURCE_NOT_FOUND(id absent from the graph or only present in an unselected branch),TASK_FORK_CYCLE(direct or indirect),TASK_FORK_SESSION_UNAVAILABLE(non-agent forking task, or a source with no usable session snapshot), andTASK_FORK_SOURCE_NOT_COMPLETE(source has not finished yet — the forked task waits). -
Agent-facing
smithersonboarding skill. A new drop-in skill (skills/smithers) teaches a coding agent to drive Smithers without reading the whole docs site, shortening the time to the first aha moment. ItsSKILL.mdon-ramp makes explicit that the agent operates Smithers on the user’s behalf (it is not a human GUI), then bundles the full docs (llms-full.txt) next to it for the exact API on demand — the docs generator now mirrors the canonical bundle intoskills/smithers/llms-full.txtso the skill stays drift-free. It is linked from the README, installation, quickstart, and the docs index with a one-line install so it is discoverable from every getting-started surface. -
Per-agent support docs. A new Agent Support docs section (
docs/agents/) documents how to wire Smithers into the coding agent a user already runs — one overview page plus dedicated guides for Claude Code, Codex, Cursor, GitHub Copilot, Pi, Hermes, and OpenClaw. Each page gives the exact, source-verified install for that agent’s surfaces: the skill (skills addor a one-linecurl), the MCP server (mcp add, or the agent’s native config —.mcp.json/.cursor/mcp.jsonJSON, Codexconfig.tomlTOML, Copilot’s two distinct VS-code-vs-cloud schemas, Hermesmcp_serversYAML, OpenClawmcp.serversJSON), and standing instructions (AGENTS.md/CLAUDE.md/Cursor rules). The overview leads with the two fan-out commands —smithers skills addandsmithers mcp add— that install the skill and register the MCP server into every detected agent (Claude Code, Codex, Cursor, Copilot, Gemini, OpenCode, Amp, Windsurf, and ~14 more) in one step, and a support matrix shows what is auto-wired versus configured by hand. The section is cross-linked from Installation and Integrations, with an/agentsredirect. -
First-class wiring for Pi, Hermes, and OpenClaw. Three agents fell outside the skill/MCP framework’s built-in registry, so
mcp add/skills addleft them to manual setup. A supplementary wiring step (apps/cli/src/agent-wiring/) now closes that gap on success:mcp addwrites the Smithers MCP server into Hermes’s~/.hermes/config.yaml(mcp_servers, YAML) and OpenClaw’s~/.openclaw/openclaw.json(mcp.servers, JSON) — directly, the way the framework already special-cases Amp — andskills addcopies the canonical.agents/skillsset into Pi’s~/.pi/agent/skills. Each writer detects the agent by its config directory, preserves existing config (and refuses to clobber an unparseable file), and honors the same--no-global,--agent, and--commandflags as the underlying command. With this, onemcp add+ oneskills addreaches every supported agent, Pi/Hermes/OpenClaw included. -
HermesAgent— run Hermes as a workflow worker. A new agent class (packages/agents/src/HermesAgent.js) lets a<Task>drive the Hermes (Nous Research) agent. Hermes exposes an OpenAI-compatible HTTP API, soHermesAgentis a thin wrapper overOpenAIAgent: it points the provider at the Hermes server viabaseURL(or theHERMES_BASE_URLenv var), defaultsapiKeyfromHERMES_API_KEY, and disables AI SDK native structured output by default since a local Hermes server may not honor JSON-schema response formats. It is re-exported fromsmithers-orchestratoralongside the other agent classes. -
Starter gallery for seeded workflows. A new
smithers starterscommand (apps/cli/src/starter-gallery.js) presents plain-English outcomes with copy-paste commands for people who want a result before writing workflow code. Each detailed starter prints the expected outcome, the context to gather first, the exactworkflow runcommand, useful follow-ups, and when not to use it. Browse the whole catalog or filter with--audience,--goal, or--workflow, and emit--format jsonwhen another tool needs the catalog. Ten canonical starters map plain outcomes (idea-to-prd,launch-checklist,customer-incident,quality-audit,ship-a-change,mission-mode, and more) onto the underlying seeded workflows. -
Guided init template selection.
smithers init --template <id>scaffolds a single starter with guided next steps instead of dropping the full pack, with validation tightened so unknown template IDs are rejected. The starter gallery and init share one source of truth for IDs and aliases, sostarters <id>lookups andinit --templatestay consistent. -
Smithers prompt optimization with provider support. A new
smithers optimizecommand (apps/cli/src/optimize-command.js,optimize-suite.js) runs an eval suite twice — a baseline with the workflow’s current prompts and an optimized run with GEPA-generated prompt patches — and writes the winning prompt artifact only when the optimized score clears--min-improvement. Artifacts patch only agent-backed<Task>prompts bynodeId, leaving workflow structure, output schemas, retries, approvals, and persistence untouched, and can be replayed into future evals viasmithers eval --optimization. The patch generator accepts the same provider vocabulary as agents and accounts —cerebras,openai/codex,anthropic/claude,gemini/antigravity,kimi/moonshot, and a generic OpenAI-compatible endpoint foropencode/pi/amp/forge— plus a deterministicheuristicprovider for tests and fixtures. The optimization artifact format lives in@smithers-orchestrator/engine(optimization-artifact.js). -
OpenCode agent detection in init.
smithers initnow detects the OpenCode CLI alongside the existing agents, recognizing its config and data directories and provider API keys so OpenCode-based workflows scaffold without manual wiring. -
Composite components now run their summary tasks on agents.
Supervisor,ScanFixVerify, andCheckSuiteeach had a final task with string children but no bound agent, so the summary rendered as a static task that emitted its prompt literally instead of synthesizing results.Supervisornow binds its final summary toprops.boss,ScanFixVerifybinds its report toprops.verifier, andCheckSuiteconverts its verdict into a compute task that depends on every check viadependsOnand aggregates pass/fail per its strategy (all-pass / majority / any-pass).EscalationChainwas also fixed to evaluate each level’sescalateIfpredicate against the prior level’s real result, so a level — including the human-fallback approval — only runs when the previous level actually escalated, instead of every level firing unconditionally. -
Generated skill front matter uses the parsed workflow description.
renderWorkflowSkillcomputeddescriptionfrom the workflow’s own metadata but hardcoded the generic default in the YAML front matter, so a customsmithers-descriptionwas ignored by the field tooling reads first. The front matter now reflects the parsed description. -
Reliability. Multiple correctness fixes across the CLI, agents, and supporting packages: text truncation in agent stderr, captured stdout/stderr, and CLI node-detail output now snaps to a UTF-8 codepoint boundary instead of slicing mid-character and emitting a U+FFFD replacement char, and
captureProcessgives stdout and stderr independent capture budgets so noisy stderr can no longer starve real command output. The bash network guard now tokenizes commands and matches blocked tools as whole executable names (and URL schemes as token prefixes), so benign commands likeecho bundle.jsorcat pipeline.txtare no longer over-blocked. Claude Code stream-json token accounting stops double-counting the terminalresultevent’s usage on top of the incrementalmessage_start/message_deltatotals. The OpenAPI tool factory preserves a server base path (e.g./v2) when joining absolute operation paths instead of dropping it and hitting a 404, andloadSpecno longer masks a genuine content-parse error by re-parsing the file path as inline spec text. The scheduler prunes staleapprovalsandretryCountson unmount so a hot-reloaded task can’t silently bypass a human-approval gate or lose its retry budget, andCachePolicynow defaults its context type tounknownrather thanany. Judge scorers parse their JSON with balanced-brace extraction so a{inside a reason no longer throws and silently scores 0. Codex session resolution scans candidate day folders derived from both UTC and local dates across adjacent days so transcripts in negative-UTC offsets are found, and Claude session fallback matches by basename for cross-platform paths. Agent RPC total/inactivity timers are now cleared on every settle path. Thedowncommand honors its--forceflag — skipping runs whose heartbeat is still fresh unless forced, where it previously force-cancelled every active run unconditionally — andlogs --followactually emits its waiting-approval/event/timer CTA hints by tracking the last waiting status observed instead of testing a condition that could never be true inside the loop’s exit block. Andsmithers-demowraps its provider-responseJSON.parsecalls to surface a descriptiveInvalid JSON response from <provider>error rather than a crypticSyntaxErroron a 200-with-bad-body.
Security
-
Rejected cross-origin PTY WebSocket upgrades. The Studio terminal server (
scripts/pty-server.ts) bound to127.0.0.1but never checkedOrigin, and because browsers do not enforce same-origin policy on WebSockets, any page a developer visited could openws://127.0.0.1/terminal/wsand spawn a shell (local RCE). The upgrade now requires a loopbackOrigin(extensible viaPTY_ALLOWED_ORIGINS) and refuses cross-origin connections with a403before any shell is spawned; a missingOrigin, which only comes from non-browser local tooling, is still allowed since it is not the drive-by vector. -
Sanitized markdown link schemes in agent output.
MarkdownContent.tsxrendered agent/LLM-authored links with an unsanitizedhref, and React 19 rendersjavascript:URLs verbatim, so a[label](javascript:...)link executed on click (DOM XSS). The renderer now allows onlyhttp,https,mailto, and relative/anchor hrefs viaisSafeHref, falling back to plain text for any unsafe scheme. -
Hardened gateway token, JWT, and run-registry handling. Token-mode auth indexed
this.auth.tokens[token]directly, so magic keys like__proto__,toString,constructor, andhasOwnPropertyresolved to inherited prototype members and were treated as valid grants; the lookup is now guarded withObject.hasOwnso those tokens are rejected asUNAUTHORIZED.verifyJwtTokenalso accepted tokens with a missing or non-numericexpclaim, meaning they never expired — a missing/invalidexpis now rejected outright. The same change deletes leakedrunRegistryentries on run completion and echoes the realframe.idin WebSocket RPC error frames. -
Denied devtools access for explicitly-empty run subscriptions.
isDevToolsRunAuthorized()treated an emptysubscribedRunsset the same as no filter, letting a client that connected withsubscribe:[]read any run’s devtools snapshots and streams. The check now distinguishes the two states:null/undefinedmeans no filter (unrestricted, backward compatible), while aSet— including an empty one — means a filter was provided, so therunIdmust be a member. An empty set therefore denies every run. -
Rejected path-traversal characters in account labels. Account labels become a path segment under
~/.smithers/accounts, so a label like../../../../etcletdefaultConfigDirescape the smithers root beforerunAgentAdd.jscreated the directory. Labels are now validated against the wizard’s[A-Za-z0-9._-]pattern (rejecting.,.., and empty), with a defense-in-depth assertion that the resolved path stays under the accounts directory so a future regex change can’t silently reintroduce traversal. -
Guarded the reference gateway token store and failed secure.
readTokens()indeploy/reference/reference-gateway.mjsparsed$SMITHERS_TOKEN_STOREwithout error handling, so a corrupt or tampered store crashed the gateway at boot. TheJSON.parseis now wrapped intry/catch: on failure it logs a warning and returns an empty token set, so a malformed store can neither crash boot nor grant access. -
Tightened the bash network guard to whole-token matching.
assertNetworkAllowedjoined the command and args into one string and used substring.includes()against fragments likebun,npm,pip, andgit, so benign commands such asecho bundle.jsorcat pipeline.txtwere wrongly rejected withTOOL_NETWORK_DISABLED. The guard now tokenizes the command and matches network tools as whole executable basenames, URL schemes as token prefixes, and git remote ops as whole tokens. The blocked tool set and theallowNetworkbypass are unchanged — only the matching is corrected. -
Capped workspace API request bodies to prevent DoS.
readJsonBody()increateWorkspaceApiServer.tsbuffered an entire request body with no limit, allowing a memory-exhaustion DoS. It now enforces a 32MB cap while streaming — it stops accumulating once exceeded so memory stays bounded, drains the rest of the stream, then rejects with HTTP413. The cap sits above the largest legitimate body (a 20MB screenshot PNG sent base64-encoded, roughly 27MB, inside a JSON envelope), so the operator-screenshot endpoint is unaffected. -
Parameterized Grafana credentials and dropped anonymous admin. The local observability stack (
observability/docker-compose.otel.yml) hardcodedGF_SECURITY_ADMIN_PASSWORD=adminand granted anonymous users theAdminrole, so anyone reaching the port had full admin with no login. The admin password is now overridable via${GF_SECURITY_ADMIN_PASSWORD:-admin}and anonymous access is downgraded to read-onlyViewer(overridable and disableable viaGF_AUTH_ANONYMOUS_ENABLED), keeping local dev frictionless but no longer privileged. -
Removed an XSS vector in the run-chronicle prototype. The
run-chronicle-v2POC built its feed and inspector viainnerHTMLwith interpolated data fields (event type, summary, detail, node IDs, tool names, dependency and timeline labels). Those assignments were replaced with DOM construction usingtextContent, so no data value is ever parsed as HTML, removing the injection vector if the currently-hardcoded data ever becomes dynamic. -
Reliability: defensive
JSON.parseand stream-cleanup guards. Several stream and message handlers were made robust against malformed input: thepi-pluginSSEevents()generator now skips unparseable frames and releases its reader on throw, return, or early consumer break (preventing a leaked connection); client-side WebSocket message handlers in the e2e fault suite guardJSON.parseto match their server-side counterparts; andstallSandbox.release()now awaits its filesystem restore to close a latent race.
Reliability & Correctness
-
Bounded the completed-activity result cache to stop a gateway memory leak. The module-level
completedActivityResultsmap inactivity-bridge.jswas keyed by composite idempotency key and never pruned, so it grew without limit across runs in a long-running gateway. It is now an insertion-ordered LRU that refreshes recency on reads and evicts the least-recently-used entry once it exceedsCOMPLETED_ACTIVITY_RESULTS_MAX; distinct keys within a single run stay well under the cap, so the duplicate-execution idempotency guarantee is preserved. -
Evicted consumed entries from the durable-deferred bridge. The process-lifetime
deferredResolutionsmap indurable-deferred-bridge.jsrecorded every approval and wait-for-event resolution but never deleted them, leaking oneExitper resolved approval or signal indefinitely. Because each stored value is anExit.succeed(...)that both consumers finalize on a successful read without re-polling,awaitBridgeDeferrednow deletes the entry after consuming it. -
Removed abort listeners on normal task completion in the driver.
withAbort’sabortPromiseregistered an"abort"listener that was only cleaned up when an abort actually fired, so every task or poll that completed normally against the run’s single long-livedAbortSignalleaked a listener, producingMaxListenersExceededWarningand steady memory growth.abortPromisenow returns a cleanup function andwithAbortraces inside atry/finally, removing the listener on both the resolve and reject paths. -
Stopped leaking a process
exitlistener per Smithers instance.createSmithers/createExternalSmithersregisteredprocess.on("exit", closeDb)but never removed it, so repeated construction (tests, gateway, hot reload) accumulated listeners — aMaxListenersExceededWarningplus retained SQLite handles. The listener is now registered withprocess.onceand detached insidecloseDbafter the database closes, whethercloseDbruns on exit or is invoked directly through cleanup; thedbClosedguard keepscloseDbidempotent. -
Cleared the per-check diagnostic timeout when a check resolves.
runCheckracedcheck.runagainst asetTimeout-based rejection but never cleared the timer on the resolve path. Because diagnostics run on every agent invocation, this left a non-unref’d ~5s timer armed on a hot path, delaying event-loop quiescence. The handle is now captured,unref’d, andclearTimeout’d in afinallyblock once the race settles; the timeout duration and rejection behavior are unchanged. -
Converted malformed-snapshot JSON into typed errors instead of crashes. Unguarded
JSON.parsecalls on persisted snapshot rows inparseSnapshotandforkRunEffectthrew a rawSyntaxErrorthat crashed the Effect fork/replay path. Every snapshot parse now routes through aparseSnapshotJsonhelper that wraps failures as aDB_QUERY_FAILEDSmithersError, andforkRunEffectsurfaces them as typed Effect failures (not defects) viaEffect.try. -
Read binary file content as raw bytes in the historical diff bundle.
readBinaryContentAtRefreconstructed binary data from a stringrunGithad already decoded as UTF-8, socomputeDiffBundleBetweenRefs/getNodeDiffproduced corrupt base64 for images, wasm, and other binaries. A newrunGitRawvariant collects rawBufferchunks and base64-encodes them directly, matching the correctness of the siblingcomputeDiffBundle; the text path is unchanged. -
Recorded added and removed Ralph loops in snapshot diffs. A one-sided Ralph loop in
diffSnapshotswas gated behind an impossibleif (aR && bR)inside anif (!aR || !bR)branch, so a loop added or removed between snapshots produced no diff entry.SnapshotDiffgainsralphAdded/ralphRemovedarrays that are now populated and rendered informatDiffForTui(and included informatDiffAsJsonvia spread). -
Emitted sibling
addNodeops in ascending-index order. DevTools add ops were ordered only by depth, leaving equal-depth siblings in arbitrary set-iteration order; becauseapplyDeltasplices each new node at its precomputed index without re-indexing, applying them out of order corrupted child order on reorders. Add ops are now sorted by depth then ascending index so sequential splice-at-index inserts reproduce the target tree. -
Appended events to the run log instead of rewriting it, and surfaced persist errors once.
persistLogre-read and rewrote the entirestream.ndjsonon every event — O(n²) IO and a clobber hazard for processes sharing a log — and is now anappendFileof just the new line. Separately, a single persist failure was delivered twice (the caller’s Effect rejected and a laterflush()re-threw the stored error, spuriously aborting a healthy task); the returned Effect now awaits the same catch-cleared promise so the failure is owned solely bypersistErrorand surfaced exactly once at flush. -
Stopped double-counting Claude
stream-jsontokens. For Claude Codestream-jsonoutput, input and output tokens are accumulated incrementally frommessage_start/message_delta, but the terminalresultevent’s top-level usage summary fell through to the generic branch and was re-added, roughly doublingagentTokensTotal.extractUsageFromOutputnow tracks whether incremental usage was counted and skips theresultevent in that case, still falling through when no incremental events were seen. -
Parameterized the
getRawNodeOutputquery and failed the Effect on a missingrunIdcolumn.getRawNodeOutputinterpolatedrunId/nodeIdstraight into SQL viasql.raw, so a quote in an id broke the query (silently returning null) and was a latent injection footgun on a public export; it now binds parameters like its sibling. Separately,loadInput/loadInputEffectandsnapshot.jsthrew aSmithersErrorsynchronously during Effect construction when the input table lacked arunIdcolumn, breaking theEffect<…, SmithersError>contract; the body is now wrapped inEffect.suspendso theDB_MISSING_COLUMNSfailure surfaces through the error channel. -
Guarded
last_insert_rowid()against null.SELECT last_insert_rowid()is typedRecord | null, so chaining.idcould dereference null.recordUsageandrecordAuditEventin the control plane now capture the row, throw a clear error when it is null, then read.id. -
Reported the correct node kind in graph extraction.
extractGraph’saddDescriptorderived duplicate-id messages fromraw.__smithersKind, which is unset forSubflow,Sandbox,WaitForEvent, andTimernodes, so those duplicates were misreported as “Duplicate Task id detected.” The kind is now passed in explicitly per task type, matching the legacydom/extract.js.resolveOutputwas also tightened to classify a value as an output table only when it is a genuine Drizzle table (viagetTableNameguarded byisDrizzleTable), instead of treating any non-string/non-Zod value as a table with an empty name. -
Relocated keyed children instead of duplicating them in the reconciler. In mutation mode React moves an already-mounted child via
insertBefore/appendChildwith no precedingremoveChild; the host config pushed intoparent.childrenwithout removing the child’s current position, so a reordered keyed child appeared twice and surfaced as aDUPLICATE_IDduring graph extraction. The host config now removes any existing occurrence of the child before inserting, matching DOM single-parent semantics; mounting a genuinely new child (indexOfreturns-1) is unchanged. -
Stopped
isSmithersErrorfrom matching arbitrary{ code, message }objects. The purely structural predicate only requiredcodeandmessage, so it matched Node system errors (ENOENT) and many third-party errors;toSmithersErrorthen returned foreign errors unwrapped or copied an invalid libuv code into the wrapper, breaking retry classification. The predicate now passes only for a realinstanceof SmithersErroror a plain object whosecodeis a knownSmithersErrorCode— covering errors deserialized over the wire while excluding Node errno codes. -
Surfaced pi-plugin RPC errors to the user. The
approve,deny, andcancelcommands in the pi-plugin extension awaitedclient.approve()/deny()/cancel()without atry/catch, so RPC failures rejected the handler silently. Each is now wrapped to notify the user viactx.ui.notify()on failure. -
Made
withCorrelationContextvisible to the imperative logger and documented the legacy shim.withCorrelationContextwrote the merged correlation context only into the EffectFiberRef, but the imperative logger reads fromAsyncLocalStorageon a freshly forked fiber, so patched context was invisible to log annotations. It now propagates the merged context into the ALS store viaacquireUseRelease(scoped to the effect, restored on release) while still setting theFiberRef. The legacyupdateCurrentCorrelationContextshim — which mutates the current context in place viaObject.assign— is now documented as an intentional compatibility path for non-Effect callers, pointing them at the Effect-based core. -
Aligned the observability dashboards and event types with what is actually emitted. The OTel collector’s Prometheus exporter prepends
smithers_and dot→underscore-normalizes instrument names, yielding double-prefixed series likesmithers_smithers_runs_total;observability/dashboards/smithers.jsonqueried a singlesmithers_prefix and returned no data, and is now aligned to the double prefix that the Grafana dashboard already used. The generatedindex.d.tsalso regained thetoolCallIdfield on theToolCallStarted/ToolCallFinishedevent types (dropped becausetsup --dts-onlydid not resolve the cross-file JSDoc import), and the Codex/Claude session resolvers were fixed to scan adjacent local-vs-UTC day folders and to match transcript paths by basename so they work on Windows. -
Reliability. The driver now attaches a stdin
"error"handler before writing to a child process so anEPIPEfrom a child that closes stdin early is logged rather than crashing the driver. The jj VCS workspace pre-create cleanup now logs failures viaEffect.logWarninginstead of swallowing them in a bare catch. Protocol error-code arrays gained@type {const}assertions so they are genuinely readonly at runtime, matching their declared tuple types. Emptysubscribe:[]subscriptions are now treated as a real filter that denies cross-run devtools access rather than as no filter. And the fault-injection e2e harness now awaitsstallSandbox.release()’s filesystem restore and guards client-side WebSocketJSON.parsecalls against malformed frames.