Skip to main content

Documentation Index

Fetch the complete documentation index at: https://smithers.sh/llms.txt

Use this file to discover all available pages before exploring further.

0.22.0

Major release: Smithers ships Smithers Studio 2, a ground-up dark agent operations console with three-tier progressive-disclosure navigation, a Cmd-K/Cmd-P command palette, a color-equals-run-state design system, and a full suite of live data surfaces — Runs, Workflows, terminal, chat, JJHub, and DevTools — all wired to real Gateway and workspace-API backends. Studio 2 now defaults to a chat-first shell — one conversation that drives issues, runs, PRs, and workflows, with the tabbed shell one toggle away. Alongside Studio 2, this release de-mocks every e2e test onto real seeded backends, adds a live Gateway Run Chronicle plus init-pack-generated per-workflow UIs and a new smithers ui command, grows the CLI with a starter gallery and GEPA-style prompt optimization, and ships a drop-in agent-facing smithers skill so a coding agent can drive Smithers without reading the whole docs site. A focused security pass closes XSS, local-RCE, path-traversal, DoS, and auth-bypass vectors, and a broad correctness sweep hardens the engine, time-travel, DB, graph, and observability stacks.

Smithers Studio 2

The headline of 0.22.0 is Smithers Studio 2 (apps/smithers-studio-2), a ground-up rebuild of the agent operations console. The original “spaceship” studio put all ~25 views on one flat sidebar with equal weight, and color everywhere meant nothing stood out. Studio 2 inverts that: a dark, near-monochrome console where saturated color only ever signals run state, and where the surfaces you reach for daily are the only ones permanently visible. It is built on Vite, React 19, and Zustand, and ships as a web app today with a desktop shell in the wings. The design and information-architecture contracts are written down first, in apps/smithers-studio-2/docs/DESIGN.md and docs/UX.md.
Smithers Studio 2 home screen with recent workspaces and a live operations strip
  • Three-tier progressive-disclosure navigation. The shell exposes exactly four primary surfaces that are always visible — Home, Runs, Workspace, Workflows — backed by a single nav registry in src/shell/navRegistry.tsx. Six secondary surfaces (Issues, Landings, Workspaces, Memory, Scores, Search) live in a collapsed-by-default “More” group, and three developer surfaces (DevTools, SQL Browser, Logs) hide entirely until opted in. Every surface registers exactly one entry in buildNavRegistry, so a new view appears in the right sidebar section and the palette automatically without touching AppShell or Sidebar.
  • Registry-level developer-mode gating. Developer surfaces are gated by conditional construction, not CSS. When the persisted developerMode flag (localStorage key studio.developerMode) is off, the developer items are simply not added to the registry, making them unreachable by sidebar, command palette, and deep-link — the sidebar is byte-for-byte identical to a non-developer session. Toggling the flag off while sitting on a developer surface falls the user back to Home rather than stranding them on an unregistered route.
  • WELCOME → FOCUS → DETAIL altitude model. The app has three navigational altitudes you can always locate yourself in. Home (src/home/Home.tsx) is the WELCOME screen — a calm centered column with two verbs (Open Folder, launch a workflow), a recent-workspaces list, and a live Operations strip; opening a folder routes to Workspace (FOCUS) and launching a workflow routes to Runs. DETAIL always lives in an inspector pane inside a surface, never as a new top-level view, which is how the old runs/snapshots/approvals/scores/logs nodes collapse into a single “what are my agents doing” surface.
  • Cmd-K / Cmd-P command palette. A universal accelerator (src/shell/CommandPalette) reaches every registered surface across all three tiers plus contextual commands, so the rail can stay tiny without trapping power users. Prefix pills switch modes — > commands, / run workflow, @ open file, ? ask AI — parsed by parseQuery.ts, and results are grouped by section (“Go to”, “More”, “Developer”, “Commands”). Developer surfaces only surface in the palette when developer mode is on, since the palette is built from the same gated registry.
Opening the command palette and navigating to a surface
  • Dark design system where color equals state. All visual tokens live as CSS custom properties in src/theme.css and are mirrored as a single TypeScript export in src/theme/themeTokens.ts, so no component hardcodes a hex value. Three stacked surfaces (--bg, --surface-1, --surface-2) carry depth, text is white at three opacities, and the only saturated colors are run-state signals: --accent blue for live/running, --success green for completed/approved, --warning amber for waiting/pending-approval gates, and --danger red for failed/denied. Motion is limited to 120–150ms ease-out feedback on real state changes and collapses to zero under prefers-reduced-motion.
  • Zustand store as the single shell state. src/useStudioStore.ts holds the whole top-level shell state — active view, developer mode, command-palette open/query/selection, and terminal tabs — with detail state colocated in each surface’s folder rather than the global union. The store keeps a backwards-compatible terminalworkspace view alias so existing hotkeys and tests that target the old terminal id keep resolving after the terminal moved inside the Workspace surface.
  • Global hotkeys and a runs badge. The shell owns global hotkeys via useHotkey (src/useHotkey.ts): Cmd-P and Cmd-K open the palette, Cmd-T opens a new terminal. The Runs nav row carries an unread approvals badge (runsBadgeStore) so time-sensitive gates surface without digging, consistent with the IA rule that anything that is a state of a run is disclosed inside the Runs surface rather than as a sibling nav node.
  • Electrobun desktop packaging direction. An Electrobun shell (electrobun.config.ts, electrobun/) wraps the already-built Vite output rather than rebuilding it: dist/ is copied verbatim as the mainview view and a Bun main process opens a single 1280×820 window pointed at views://mainview/index.html, with a SMITHERS_STUDIO_DEV_URL override that points the webview at the Vite dev server for HMR. It installs a minimal native macOS menu (app/quit, Edit roles so copy-paste works in the webview, Window controls). The Vite web app stays the source of truth — the desktop shell adds no UI of its own.

Chat-First Shell

The default Studio 2 surface is now a chat-first shell (src/chat): one long conversation with the agent is the whole app, replacing tab navigation. Rather than clicking between surfaces, you tell the agent what you want and it manages issues, runs, PRs, workflows, and sandboxes for you, showing rich data inline through a sandboxed-HTML tool or in an overlay that can split beside the chat or sit full-screen over it. Slash commands map to Smithers CLI features and open a default UI as an overlay — surface overlays reuse the existing real Studio surfaces verbatim, and the terminal overlay reuses the real Ghostty PTY. The switch is non-destructive: the classic tabbed shell (shell/AppShell) is untouched and one toggle away via the /studio command or the project-bar gear, gated by a new shellMode flag in the studio store that App switches on. Concepts the backend does not model yet — projects, per-message tags, the chat feed, the agent HTML tool, and overlays — are fed from typed seams with mock implementations behind real-ready interfaces (grep SEAM:), while everything that already has a backend (gateway, workspace API, the reused surfaces, the PTY) is wired to it directly. Pure helpers (parseSlash, resolveSlashAction, tagColor) are split out and unit-tested, and the shell has real-backend e2e coverage.

Studio 2 Data Surfaces and Workspace

Behind the shell, every surface is a real, live view over the Gateway and workspace-API backends. Here is what each one does.
  • Live Runs inspection surface. The Runs view pairs a run-history rail (with an approvals filter and a live nav badge) against a responsive tree-plus-inspector layout that splits with a draggable divider above 800px and falls back to a modal inspector sheet on narrow widths. It streams run.event frames over the Gateway’s real WebSocket protocol (a connect + streamRunEvents subscription routed through /v1/rpc), debouncing event bursts into a single getRun + getDevToolsSnapshot refresh of only the selected run so a chatty run never hammers the list RPCs. The RunToolbar exposes the lifecycle actions valid for the current state — cancel a live run, resume a terminal one — plus a frame scrubber for time-travel rewind.
Studio 2 Runs surface streaming live run events into a tree and inspector
  • Inline approval gates. When a selected node has a pending approval, an ApprovalGate renders directly inside the inspector with an optional note and approve/deny buttons that post submitApproval. Approvals are surfaced three ways from one source of truth: the history-list filter, the inline gate, and the sidebar badge count driven by runsBadgeStore.
  • Embedded workflow UIs. A run whose workflow ships its own custom UI defaults to rendering that bundle in an iframe via WorkflowRunUi, scoped with ?runId=<id>; the studio host proxies /workflows/*, /v1/rpc, and the run-event socket so the embedded UI boots same-origin. A toolbar toggle swaps between the custom UI and the generic tree/inspector.
  • Workflows launcher. The Workflows surface browses launchable things across Local, Remote, Prompts, and Schedules segments, shows a selected workflow’s source and summary, and launches it with arguments. When a workflow declares typed launch fields it renders one control per field (text/number/boolean/JSON) with inline validation, otherwise a freeform JSON textarea. Launching creates a run and hands off to the Runs surface with the new run auto-selected.
  • In-app workspace terminal and agent chat. The Workspace surface is segmented between a Ghostty-rendered terminal and an agent chat. Each terminal pane owns its own GhosttyCore WASM instance keyed by tab id, and the core is evicted on tab close so WASM linear memory is reclaimed instead of leaking on every open/close. Terminals are backed by a JSON-RPC-over-WebSocket PTY server (scripts/pty-server.ts) with scrollback replay on attach, a concurrent-session cap, frame-size limits, and an orphan reaper. The agent-chat segment renders streaming markdown/code responses with a model/mode indicator and a calm auto-scroll that only pins to the newest block when the viewer is already near the bottom; it stays mounted across segment switches so history is preserved.
In-app workspace terminal running a real shell via the Ghostty renderer
  • Home welcome and operations strip. Home is a calm welcome column with recent workspaces and a live Operations strip that derives running / waiting / pending-approval counts from the same listRuns + listApprovals RPCs as the Runs surface, polled on a lighter 5s cadence. When the workspace backend is unreachable it shows a connect/boot panel instead of empty recents.
  • JJHub issues, landings, and workspaces. The IssuesPanel lists, filters, opens, creates, and closes/reopens JJHub issues; the LandingsPanel shows a landing’s info, diff, checks, and conflicts in tabs and supports review and land actions; the WorkspacesPanel manages cloud workspaces (create, suspend, resume, fork, delete) and their snapshots alongside local workspace recents. All three read JJHub auth status and route through the real jj/jjhub CLIs behind the workspace API.
  • Memory, Scores, and Search. Memory queries cross-run memory facts by namespace/key with a debounced search; Scores shows aggregate and per-run scorer results; Search is a global workspace search with a Code / Issues / Repos / Transcripts scope control, request-generation guarding to drop stale responses, and serves as the full-page fallback for the palette’s search mode. Each is wired to the real /memory, /scores, and /search workspace endpoints.
  • DevTools, SQL Browser, and Logs. Behind the developer toggle, DevTools renders the raw, unfiltered DevTools snapshot tree and node props for a run picked from listRuns; the SqlBrowser is a read-only query surface over the workspace SQLite database that lists tables, shows a selected table’s schema, and runs ad-hoc queries; Logs is a dense, filterable event-log firehose (by level, category, and free text) with error/warning counts.
Studio 2 DevTools surface showing the raw run snapshot tree and node inspector

Studio 2: Real Backends, De-Mocking, and Hardening

  • bun dev now boots a real Gateway and real workspace API. The Studio 2 dev stack runs an actual Gateway (server/startGatewayServer.ts) and workspace-API backend (server/createWorkspaceApiServer.ts, server/workspaceBackend.ts) against the real workspace smithers.db — wiring real SQL, scores, memory, chat-agent, and jj/jjhub operations instead of fabricated responses. No mocks remain in the product runtime path. Seeded fixtures are now strictly opt-in behind the SMITHERS_DEV_USE_FIXTURES flag for both the gateway and workspace API.
  • The e2e suite drives real backends with seeded data, with no route mocks. Playwright specs across every surface (runs live-inspection, rewind/resume, home, command palette, SQL browser, DevTools tree, terminal lifecycle, landings, chat) now exercise the real Gateway and workspace backend rather than page.route/routeWebSocket stand-ins, including a real read-only SQL exec path that genuinely rejects DELETE. Per-test x-studio-session isolation makes the parallel suite deterministic, and Playwright is restricted to *.spec.ts so bun *.test.ts units aren’t node-loaded by the browser runner.
  • Live run events ride the real Gateway WebSocket protocol. Run inspection now connects and subscribes over /v1/rpc (through the vite ws proxy) using the real gateway WS protocol instead of a dead raw socket, and streamRunEventsResilient reconnects on a silent 1006 close. The poll/event refresh paths were deduped so a live run refreshes only the selected run rather than hammering the gateway with full-list refetches every tick.
  • Stale-response races eliminated across async surfaces. useGatewayRpc now tags each refetch with a generation counter and discards responses from superseded requests, so a slow earlier call can no longer clobber fresher data. LandingsPanel guards its diff/checks/conflicts loaders against a previously-selected landing’s late response overwriting the current one, and useGatewayRunEvents clears its events array when runId changes so switching from run A to run B no longer shows A’s events followed by B’s.
  • Unmount paths settle and prune their state. usePtySession now explicitly rejects all pending RPC promises with a “terminal unmounted” error before clearing the map, instead of relying on the async ws.close() onclose to settle them (which fired after the map was already empty, leaking in-flight calls). The scheduler’s pruneUnmounted loop now also deletes stale state.approvals and state.retryCounts entries, closing a hot-reload bug where a re-mounted task silently bypassed its human-approval gate and inherited an exhausted retry budget.
  • Gateway auth and registry hardening. packages/server/src/gateway.js received four surgical fixes: startRun now deletes its runRegistry entry on completion (it previously leaked one record per run forever); token-mode auth guards the lookup with Object.hasOwn so prototype keys like __proto__ or toString return 401 rather than crashing with a 500; verifyJwtToken now rejects tokens with a missing or non-numeric exp claim; and the WS error handler echoes the real frame.id instead of hardcoded "invalid"/"server" ids.
  • Client-side security fixes. The PTY WebSocket server now rejects cross-origin upgrades (gating on a loopback Origin) to prevent a drive-by page from spawning a local shell; agent/LLM markdown links are sanitized to allow only http/https/mailto/relative hrefs so a javascript: URL can no longer execute on click under React 19; and the workspace API caps request bodies at 32MB (streaming, then HTTP 413) to prevent a memory-exhaustion DoS while still admitting a base64 screenshot envelope.
  • Round-3 UX, accessibility, and robustness polish. Across all surfaces: run-tree keyboard navigation with roving focus, toolbar lifecycle tooltips, clearer empty states, calmer chat auto-scroll that only pins when near the bottom, ARIA status/alert roles on chat/terminal/approval and combobox/listbox roles on the command palette, SQL read-only messaging, and a ***bold-italic*** markdown fix. The 2-second live-run poll was deduped so identical ticks no longer churn the tree and reset expansion state.
  • Reliability. GET /memory and /scores no longer drop their default limit (a missing ?limit had coerced to Number(null) === 0, returning a single row); the Approval decision output now populates decidedAt from decidedAtMs and evaluates autoApprove.condition/revertOn callbacks exactly once per render; Claude stream-json token counting no longer double-counts the terminal result event’s usage; the LLM-judge scorer parses balanced-brace JSON so a brace inside a reason no longer forces a score of 0; RPC total/inactivity timers are cleared on all settle paths; the bash network guard matches blocked tools as whole tokens (so echo bundle.js is no longer rejected); the reference gateway’s token store and websocket client message handlers guard JSON.parse to fail secure on corrupt input; the workspace API surfaces a clean gateway-empty error instead of a raw TypeError when the gateway answers a request with a 200 and an empty body; the cloud-workspace flow clears restoreSourceSnapshotId when the Restore modal is cancelled or a plain New Workspace modal is opened, so a Restore → Cancel → New Workspace → Create sequence no longer silently restores an intended-empty workspace from a stale snapshot; and stallSandbox.release() now awaits its filesystem restore.

Gateway Console & Workflow UIs

  • Added the Run Chronicle to the Gateway console. Selecting a run in the operator console (packages/server/src/gatewayUi/defaultOperatorUi.js) now opens a live, three-pane detail view: a DevTools Tree of the workflow’s nodes, a Chronicle event log, and an Inspector showing the selected node’s props, output, and filesystem diff. The console opens its own Gateway WebSocket and subscribes to streamDevTools and streamRunEvents, applying incremental DevTools snapshot/delta ops (addNode, removeNode, updateProps, updateTask, replaceRoot) to keep the tree current and prepending run events to a bounded chronicle. Node output and diff are fetched on demand via rpcSocket("getNodeOutput") / getNodeDiff. Live status pills surface the DevTools and event stream state (Live, Retrying, Resynced, Needs refresh), the current frame/seq, and heartbeat age; stream failures retry with backoff and stale generations are discarded so switching runs never crosses wires. The new selectedRunId and per-run stream state reset cleanly between selections.
  • Attributed runs correctly across workflows sharing a database. The Gateway (packages/server/src/gateway.js) now caches one SmithersDb adapter per underlying database — iterating each DB once rather than once per registered workflow — and attributes each run to its true workflow via the run’s gatewayWorkflowKey config (falling back to the row’s workflowName). This prevents cross-workflow runs on a shared project DB from being misattributed to whichever adapter happened to find the row, which is what makes a single Gateway hosting many workflow UIs over one DB report each run under the right workflow.
  • Scaffolded a bespoke React UI for every init-pack workflow. smithers init now emits a .smithers/ui/<key>.tsx standalone React app for each workflow (apps/cli/src/workflow-pack.js, sources in apps/cli/src/workflowUiSources.js), covering plan, implement, research-plan-implement, review, research, ticket-create, tickets-create, ralph, improve-test-coverage, debug, grill-me, write-a-prd, feature-enum, audit, mission, workflow-skill, plus dedicated kanban and plan renderers. Each UI is built on the gateway-react hooks (useGatewayRuns, useGatewayRunEvents, useGatewayNodeOutput, useGatewayActions, useGatewayApprovals) and renders a live, workflow-specific view — for example review.tsx shows a per-reviewer approval verdict bar driven by the run’s real review:N node outputs. The generated .smithers/gateway.ts mounts each workflow and its UI independently via a UI_WORKFLOWS table, logging a /workflows/<key> mount URL; a workflow that fails to import (e.g. broken MDX) disables only its own UI while the rest of the Gateway stays up.
Generated Review workflow UI showing per-reviewer approval verdicts
  • Added smithers ui [runId] to open a run’s workflow UI. The new command (apps/cli/src/index.js) resolves a run to its workflow and the Gateway’s uiPath over /v1/rpc (listWorkflows, listRuns, getRun), then opens <gateway><uiPath>?runId=<id> in the browser. With no runId it attaches to the most recent run; --workflow opens a workflow UI directly, --gateway/--port target a specific Gateway, and --no-open just prints the URL. It fails clearly when no Gateway is reachable or the resolved workflow has no UI mounted.
  • Covered the workflow UIs with a real-browser e2e harness. apps/cli/tests/workflow-ui-all.e2e.test.js runs a real smithers init, boots the generated gateway.ts as a live server, and loads all 16 UIs in headless Chromium (no route mocking) — every UI is asserted to build, boot, and mount, and the deterministic subset (driven by descriptors in workflow-ui-descriptors.json) is executed for real and asserted to render that run’s actual node output. This is the check that caught a write-a-prd MDX failure and a Gateway-wide crash. A companion smithers ui e2e verifies the command opens the right run’s UI.
  • Reliability. useGatewayRunEvents now clears its events array when runId changes so switching runs no longer concatenates the previous run’s events; useGatewayRpc discards stale responses via a per-request generation counter so a slow earlier request can’t clobber fresher data; and the pi-plugin SSE events() generator now releases its stream reader on throw, return, or consumer break and guards JSON.parse so a single malformed frame is skipped rather than aborting the stream. The Gateway WebSocket frame parsing in the console is likewise guarded against invalid frames, and the run-chronicle prototype was hardened against XSS by building DOM nodes with textContent instead of innerHTML.

CLI, Evals, Agents & Workflow Tooling

  • <Task fork="task-id"> — immutable agent session forking. A new fork prop on <Task> starts an agent task from a copy of another task’s final session context. fork adds the source as an implicit dependency (the forked task waits for it to complete), composes with dependsOn/needs/deps/Sequence/Parallel/Branch/Loop, and copies the source’s persisted conversation snapshot into a fresh, independent session before submitting the new prompt — the source is never mutated and its native session id is never reused. This enables follow-up chains (plan → implement → verify), parallel branches that all start from the same base context, and reforking (a forked task can itself be forked). Inside a <Loop>, fork resolves to the latest completed snapshot for that task id. New validation surfaces clear typed errors for invalid fork sources: TASK_FORK_SOURCE_NOT_FOUND (id absent from the graph or only present in an unselected branch), TASK_FORK_CYCLE (direct or indirect), TASK_FORK_SESSION_UNAVAILABLE (non-agent forking task, or a source with no usable session snapshot), and TASK_FORK_SOURCE_NOT_COMPLETE (source has not finished yet — the forked task waits).
  • Agent-facing smithers onboarding skill. A new drop-in skill (skills/smithers) teaches a coding agent to drive Smithers without reading the whole docs site, shortening the time to the first aha moment. Its SKILL.md on-ramp makes explicit that the agent operates Smithers on the user’s behalf (it is not a human GUI), then bundles the full docs (llms-full.txt) next to it for the exact API on demand — the docs generator now mirrors the canonical bundle into skills/smithers/llms-full.txt so the skill stays drift-free. It is linked from the README, installation, quickstart, and the docs index with a one-line install so it is discoverable from every getting-started surface.
  • Per-agent support docs. A new Agent Support docs section (docs/agents/) documents how to wire Smithers into the coding agent a user already runs — one overview page plus dedicated guides for Claude Code, Codex, Cursor, GitHub Copilot, Pi, Hermes, and OpenClaw. Each page gives the exact, source-verified install for that agent’s surfaces: the skill (skills add or a one-line curl), the MCP server (mcp add, or the agent’s native config — .mcp.json/.cursor/mcp.json JSON, Codex config.toml TOML, Copilot’s two distinct VS-code-vs-cloud schemas, Hermes mcp_servers YAML, OpenClaw mcp.servers JSON), and standing instructions (AGENTS.md/CLAUDE.md/Cursor rules). The overview leads with the two fan-out commands — smithers skills add and smithers mcp add — that install the skill and register the MCP server into every detected agent (Claude Code, Codex, Cursor, Copilot, Gemini, OpenCode, Amp, Windsurf, and ~14 more) in one step, and a support matrix shows what is auto-wired versus configured by hand. The section is cross-linked from Installation and Integrations, with an /agents redirect.
  • First-class wiring for Pi, Hermes, and OpenClaw. Three agents fell outside the skill/MCP framework’s built-in registry, so mcp add / skills add left them to manual setup. A supplementary wiring step (apps/cli/src/agent-wiring/) now closes that gap on success: mcp add writes the Smithers MCP server into Hermes’s ~/.hermes/config.yaml (mcp_servers, YAML) and OpenClaw’s ~/.openclaw/openclaw.json (mcp.servers, JSON) — directly, the way the framework already special-cases Amp — and skills add copies the canonical .agents/skills set into Pi’s ~/.pi/agent/skills. Each writer detects the agent by its config directory, preserves existing config (and refuses to clobber an unparseable file), and honors the same --no-global, --agent, and --command flags as the underlying command. With this, one mcp add + one skills add reaches every supported agent, Pi/Hermes/OpenClaw included.
  • HermesAgent — run Hermes as a workflow worker. A new agent class (packages/agents/src/HermesAgent.js) lets a <Task> drive the Hermes (Nous Research) agent. Hermes exposes an OpenAI-compatible HTTP API, so HermesAgent is a thin wrapper over OpenAIAgent: it points the provider at the Hermes server via baseURL (or the HERMES_BASE_URL env var), defaults apiKey from HERMES_API_KEY, and disables AI SDK native structured output by default since a local Hermes server may not honor JSON-schema response formats. It is re-exported from smithers-orchestrator alongside the other agent classes.
  • Starter gallery for seeded workflows. A new smithers starters command (apps/cli/src/starter-gallery.js) presents plain-English outcomes with copy-paste commands for people who want a result before writing workflow code. Each detailed starter prints the expected outcome, the context to gather first, the exact workflow run command, useful follow-ups, and when not to use it. Browse the whole catalog or filter with --audience, --goal, or --workflow, and emit --format json when another tool needs the catalog. Ten canonical starters map plain outcomes (idea-to-prd, launch-checklist, customer-incident, quality-audit, ship-a-change, mission-mode, and more) onto the underlying seeded workflows.
  • Guided init template selection. smithers init --template <id> scaffolds a single starter with guided next steps instead of dropping the full pack, with validation tightened so unknown template IDs are rejected. The starter gallery and init share one source of truth for IDs and aliases, so starters <id> lookups and init --template stay consistent.
  • Smithers prompt optimization with provider support. A new smithers optimize command (apps/cli/src/optimize-command.js, optimize-suite.js) runs an eval suite twice — a baseline with the workflow’s current prompts and an optimized run with GEPA-generated prompt patches — and writes the winning prompt artifact only when the optimized score clears --min-improvement. Artifacts patch only agent-backed <Task> prompts by nodeId, leaving workflow structure, output schemas, retries, approvals, and persistence untouched, and can be replayed into future evals via smithers eval --optimization. The patch generator accepts the same provider vocabulary as agents and accounts — cerebras, openai/codex, anthropic/claude, gemini/antigravity, kimi/moonshot, and a generic OpenAI-compatible endpoint for opencode/pi/amp/forge — plus a deterministic heuristic provider for tests and fixtures. The optimization artifact format lives in @smithers-orchestrator/engine (optimization-artifact.js).
  • OpenCode agent detection in init. smithers init now detects the OpenCode CLI alongside the existing agents, recognizing its config and data directories and provider API keys so OpenCode-based workflows scaffold without manual wiring.
  • Composite components now run their summary tasks on agents. Supervisor, ScanFixVerify, and CheckSuite each had a final task with string children but no bound agent, so the summary rendered as a static task that emitted its prompt literally instead of synthesizing results. Supervisor now binds its final summary to props.boss, ScanFixVerify binds its report to props.verifier, and CheckSuite converts its verdict into a compute task that depends on every check via dependsOn and aggregates pass/fail per its strategy (all-pass / majority / any-pass). EscalationChain was also fixed to evaluate each level’s escalateIf predicate against the prior level’s real result, so a level — including the human-fallback approval — only runs when the previous level actually escalated, instead of every level firing unconditionally.
  • Generated skill front matter uses the parsed workflow description. renderWorkflowSkill computed description from the workflow’s own metadata but hardcoded the generic default in the YAML front matter, so a custom smithers-description was ignored by the field tooling reads first. The front matter now reflects the parsed description.
  • Reliability. Multiple correctness fixes across the CLI, agents, and supporting packages: text truncation in agent stderr, captured stdout/stderr, and CLI node-detail output now snaps to a UTF-8 codepoint boundary instead of slicing mid-character and emitting a U+FFFD replacement char, and captureProcess gives stdout and stderr independent capture budgets so noisy stderr can no longer starve real command output. The bash network guard now tokenizes commands and matches blocked tools as whole executable names (and URL schemes as token prefixes), so benign commands like echo bundle.js or cat pipeline.txt are no longer over-blocked. Claude Code stream-json token accounting stops double-counting the terminal result event’s usage on top of the incremental message_start/message_delta totals. The OpenAPI tool factory preserves a server base path (e.g. /v2) when joining absolute operation paths instead of dropping it and hitting a 404, and loadSpec no longer masks a genuine content-parse error by re-parsing the file path as inline spec text. The scheduler prunes stale approvals and retryCounts on unmount so a hot-reloaded task can’t silently bypass a human-approval gate or lose its retry budget, and CachePolicy now defaults its context type to unknown rather than any. Judge scorers parse their JSON with balanced-brace extraction so a { inside a reason no longer throws and silently scores 0. Codex session resolution scans candidate day folders derived from both UTC and local dates across adjacent days so transcripts in negative-UTC offsets are found, and Claude session fallback matches by basename for cross-platform paths. Agent RPC total/inactivity timers are now cleared on every settle path. The down command honors its --force flag — skipping runs whose heartbeat is still fresh unless forced, where it previously force-cancelled every active run unconditionally — and logs --follow actually emits its waiting-approval/event/timer CTA hints by tracking the last waiting status observed instead of testing a condition that could never be true inside the loop’s exit block. And smithers-demo wraps its provider-response JSON.parse calls to surface a descriptive Invalid JSON response from <provider> error rather than a cryptic SyntaxError on a 200-with-bad-body.

Security

  • Rejected cross-origin PTY WebSocket upgrades. The Studio terminal server (scripts/pty-server.ts) bound to 127.0.0.1 but never checked Origin, and because browsers do not enforce same-origin policy on WebSockets, any page a developer visited could open ws://127.0.0.1/terminal/ws and spawn a shell (local RCE). The upgrade now requires a loopback Origin (extensible via PTY_ALLOWED_ORIGINS) and refuses cross-origin connections with a 403 before any shell is spawned; a missing Origin, which only comes from non-browser local tooling, is still allowed since it is not the drive-by vector.
  • Sanitized markdown link schemes in agent output. MarkdownContent.tsx rendered agent/LLM-authored links with an unsanitized href, and React 19 renders javascript: URLs verbatim, so a [label](javascript:...) link executed on click (DOM XSS). The renderer now allows only http, https, mailto, and relative/anchor hrefs via isSafeHref, falling back to plain text for any unsafe scheme.
  • Hardened gateway token, JWT, and run-registry handling. Token-mode auth indexed this.auth.tokens[token] directly, so magic keys like __proto__, toString, constructor, and hasOwnProperty resolved to inherited prototype members and were treated as valid grants; the lookup is now guarded with Object.hasOwn so those tokens are rejected as UNAUTHORIZED. verifyJwtToken also accepted tokens with a missing or non-numeric exp claim, meaning they never expired — a missing/invalid exp is now rejected outright. The same change deletes leaked runRegistry entries on run completion and echoes the real frame.id in WebSocket RPC error frames.
  • Denied devtools access for explicitly-empty run subscriptions. isDevToolsRunAuthorized() treated an empty subscribedRuns set the same as no filter, letting a client that connected with subscribe:[] read any run’s devtools snapshots and streams. The check now distinguishes the two states: null/undefined means no filter (unrestricted, backward compatible), while a Set — including an empty one — means a filter was provided, so the runId must be a member. An empty set therefore denies every run.
  • Rejected path-traversal characters in account labels. Account labels become a path segment under ~/.smithers/accounts, so a label like ../../../../etc let defaultConfigDir escape the smithers root before runAgentAdd.js created the directory. Labels are now validated against the wizard’s [A-Za-z0-9._-] pattern (rejecting ., .., and empty), with a defense-in-depth assertion that the resolved path stays under the accounts directory so a future regex change can’t silently reintroduce traversal.
  • Guarded the reference gateway token store and failed secure. readTokens() in deploy/reference/reference-gateway.mjs parsed $SMITHERS_TOKEN_STORE without error handling, so a corrupt or tampered store crashed the gateway at boot. The JSON.parse is now wrapped in try/catch: on failure it logs a warning and returns an empty token set, so a malformed store can neither crash boot nor grant access.
  • Tightened the bash network guard to whole-token matching. assertNetworkAllowed joined the command and args into one string and used substring .includes() against fragments like bun, npm, pip, and git, so benign commands such as echo bundle.js or cat pipeline.txt were wrongly rejected with TOOL_NETWORK_DISABLED. The guard now tokenizes the command and matches network tools as whole executable basenames, URL schemes as token prefixes, and git remote ops as whole tokens. The blocked tool set and the allowNetwork bypass are unchanged — only the matching is corrected.
  • Capped workspace API request bodies to prevent DoS. readJsonBody() in createWorkspaceApiServer.ts buffered an entire request body with no limit, allowing a memory-exhaustion DoS. It now enforces a 32MB cap while streaming — it stops accumulating once exceeded so memory stays bounded, drains the rest of the stream, then rejects with HTTP 413. The cap sits above the largest legitimate body (a 20MB screenshot PNG sent base64-encoded, roughly 27MB, inside a JSON envelope), so the operator-screenshot endpoint is unaffected.
  • Parameterized Grafana credentials and dropped anonymous admin. The local observability stack (observability/docker-compose.otel.yml) hardcoded GF_SECURITY_ADMIN_PASSWORD=admin and granted anonymous users the Admin role, so anyone reaching the port had full admin with no login. The admin password is now overridable via ${GF_SECURITY_ADMIN_PASSWORD:-admin} and anonymous access is downgraded to read-only Viewer (overridable and disableable via GF_AUTH_ANONYMOUS_ENABLED), keeping local dev frictionless but no longer privileged.
  • Removed an XSS vector in the run-chronicle prototype. The run-chronicle-v2 POC built its feed and inspector via innerHTML with interpolated data fields (event type, summary, detail, node IDs, tool names, dependency and timeline labels). Those assignments were replaced with DOM construction using textContent, so no data value is ever parsed as HTML, removing the injection vector if the currently-hardcoded data ever becomes dynamic.
  • Reliability: defensive JSON.parse and stream-cleanup guards. Several stream and message handlers were made robust against malformed input: the pi-plugin SSE events() generator now skips unparseable frames and releases its reader on throw, return, or early consumer break (preventing a leaked connection); client-side WebSocket message handlers in the e2e fault suite guard JSON.parse to match their server-side counterparts; and stallSandbox.release() now awaits its filesystem restore to close a latent race.

Reliability & Correctness

  • Bounded the completed-activity result cache to stop a gateway memory leak. The module-level completedActivityResults map in activity-bridge.js was keyed by composite idempotency key and never pruned, so it grew without limit across runs in a long-running gateway. It is now an insertion-ordered LRU that refreshes recency on reads and evicts the least-recently-used entry once it exceeds COMPLETED_ACTIVITY_RESULTS_MAX; distinct keys within a single run stay well under the cap, so the duplicate-execution idempotency guarantee is preserved.
  • Evicted consumed entries from the durable-deferred bridge. The process-lifetime deferredResolutions map in durable-deferred-bridge.js recorded every approval and wait-for-event resolution but never deleted them, leaking one Exit per resolved approval or signal indefinitely. Because each stored value is an Exit.succeed(...) that both consumers finalize on a successful read without re-polling, awaitBridgeDeferred now deletes the entry after consuming it.
  • Removed abort listeners on normal task completion in the driver. withAbort’s abortPromise registered an "abort" listener that was only cleaned up when an abort actually fired, so every task or poll that completed normally against the run’s single long-lived AbortSignal leaked a listener, producing MaxListenersExceededWarning and steady memory growth. abortPromise now returns a cleanup function and withAbort races inside a try/finally, removing the listener on both the resolve and reject paths.
  • Stopped leaking a process exit listener per Smithers instance. createSmithers / createExternalSmithers registered process.on("exit", closeDb) but never removed it, so repeated construction (tests, gateway, hot reload) accumulated listeners — a MaxListenersExceededWarning plus retained SQLite handles. The listener is now registered with process.once and detached inside closeDb after the database closes, whether closeDb runs on exit or is invoked directly through cleanup; the dbClosed guard keeps closeDb idempotent.
  • Cleared the per-check diagnostic timeout when a check resolves. runCheck raced check.run against a setTimeout-based rejection but never cleared the timer on the resolve path. Because diagnostics run on every agent invocation, this left a non-unref’d ~5s timer armed on a hot path, delaying event-loop quiescence. The handle is now captured, unref’d, and clearTimeout’d in a finally block once the race settles; the timeout duration and rejection behavior are unchanged.
  • Converted malformed-snapshot JSON into typed errors instead of crashes. Unguarded JSON.parse calls on persisted snapshot rows in parseSnapshot and forkRunEffect threw a raw SyntaxError that crashed the Effect fork/replay path. Every snapshot parse now routes through a parseSnapshotJson helper that wraps failures as a DB_QUERY_FAILED SmithersError, and forkRunEffect surfaces them as typed Effect failures (not defects) via Effect.try.
  • Read binary file content as raw bytes in the historical diff bundle. readBinaryContentAtRef reconstructed binary data from a string runGit had already decoded as UTF-8, so computeDiffBundleBetweenRefs / getNodeDiff produced corrupt base64 for images, wasm, and other binaries. A new runGitRaw variant collects raw Buffer chunks and base64-encodes them directly, matching the correctness of the sibling computeDiffBundle; the text path is unchanged.
  • Recorded added and removed Ralph loops in snapshot diffs. A one-sided Ralph loop in diffSnapshots was gated behind an impossible if (aR && bR) inside an if (!aR || !bR) branch, so a loop added or removed between snapshots produced no diff entry. SnapshotDiff gains ralphAdded / ralphRemoved arrays that are now populated and rendered in formatDiffForTui (and included in formatDiffAsJson via spread).
  • Emitted sibling addNode ops in ascending-index order. DevTools add ops were ordered only by depth, leaving equal-depth siblings in arbitrary set-iteration order; because applyDelta splices each new node at its precomputed index without re-indexing, applying them out of order corrupted child order on reorders. Add ops are now sorted by depth then ascending index so sequential splice-at-index inserts reproduce the target tree.
  • Appended events to the run log instead of rewriting it, and surfaced persist errors once. persistLog re-read and rewrote the entire stream.ndjson on every event — O(n²) IO and a clobber hazard for processes sharing a log — and is now an appendFile of just the new line. Separately, a single persist failure was delivered twice (the caller’s Effect rejected and a later flush() re-threw the stored error, spuriously aborting a healthy task); the returned Effect now awaits the same catch-cleared promise so the failure is owned solely by persistError and surfaced exactly once at flush.
  • Stopped double-counting Claude stream-json tokens. For Claude Code stream-json output, input and output tokens are accumulated incrementally from message_start / message_delta, but the terminal result event’s top-level usage summary fell through to the generic branch and was re-added, roughly doubling agentTokensTotal. extractUsageFromOutput now tracks whether incremental usage was counted and skips the result event in that case, still falling through when no incremental events were seen.
  • Parameterized the getRawNodeOutput query and failed the Effect on a missing runId column. getRawNodeOutput interpolated runId / nodeId straight into SQL via sql.raw, so a quote in an id broke the query (silently returning null) and was a latent injection footgun on a public export; it now binds parameters like its sibling. Separately, loadInput / loadInputEffect and snapshot.js threw a SmithersError synchronously during Effect construction when the input table lacked a runId column, breaking the Effect<…, SmithersError> contract; the body is now wrapped in Effect.suspend so the DB_MISSING_COLUMNS failure surfaces through the error channel.
  • Guarded last_insert_rowid() against null. SELECT last_insert_rowid() is typed Record | null, so chaining .id could dereference null. recordUsage and recordAuditEvent in the control plane now capture the row, throw a clear error when it is null, then read .id.
  • Reported the correct node kind in graph extraction. extractGraph’s addDescriptor derived duplicate-id messages from raw.__smithersKind, which is unset for Subflow, Sandbox, WaitForEvent, and Timer nodes, so those duplicates were misreported as “Duplicate Task id detected.” The kind is now passed in explicitly per task type, matching the legacy dom/extract.js. resolveOutput was also tightened to classify a value as an output table only when it is a genuine Drizzle table (via getTableName guarded by isDrizzleTable), instead of treating any non-string/non-Zod value as a table with an empty name.
  • Relocated keyed children instead of duplicating them in the reconciler. In mutation mode React moves an already-mounted child via insertBefore/appendChild with no preceding removeChild; the host config pushed into parent.children without removing the child’s current position, so a reordered keyed child appeared twice and surfaced as a DUPLICATE_ID during graph extraction. The host config now removes any existing occurrence of the child before inserting, matching DOM single-parent semantics; mounting a genuinely new child (indexOf returns -1) is unchanged.
  • Stopped isSmithersError from matching arbitrary { code, message } objects. The purely structural predicate only required code and message, so it matched Node system errors (ENOENT) and many third-party errors; toSmithersError then returned foreign errors unwrapped or copied an invalid libuv code into the wrapper, breaking retry classification. The predicate now passes only for a real instanceof SmithersError or a plain object whose code is a known SmithersErrorCode — covering errors deserialized over the wire while excluding Node errno codes.
  • Surfaced pi-plugin RPC errors to the user. The approve, deny, and cancel commands in the pi-plugin extension awaited client.approve() / deny() / cancel() without a try/catch, so RPC failures rejected the handler silently. Each is now wrapped to notify the user via ctx.ui.notify() on failure.
  • Made withCorrelationContext visible to the imperative logger and documented the legacy shim. withCorrelationContext wrote the merged correlation context only into the Effect FiberRef, but the imperative logger reads from AsyncLocalStorage on a freshly forked fiber, so patched context was invisible to log annotations. It now propagates the merged context into the ALS store via acquireUseRelease (scoped to the effect, restored on release) while still setting the FiberRef. The legacy updateCurrentCorrelationContext shim — which mutates the current context in place via Object.assign — is now documented as an intentional compatibility path for non-Effect callers, pointing them at the Effect-based core.
  • Aligned the observability dashboards and event types with what is actually emitted. The OTel collector’s Prometheus exporter prepends smithers_ and dot→underscore-normalizes instrument names, yielding double-prefixed series like smithers_smithers_runs_total; observability/dashboards/smithers.json queried a single smithers_ prefix and returned no data, and is now aligned to the double prefix that the Grafana dashboard already used. The generated index.d.ts also regained the toolCallId field on the ToolCallStarted / ToolCallFinished event types (dropped because tsup --dts-only did not resolve the cross-file JSDoc import), and the Codex/Claude session resolvers were fixed to scan adjacent local-vs-UTC day folders and to match transcript paths by basename so they work on Windows.
  • Reliability. The driver now attaches a stdin "error" handler before writing to a child process so an EPIPE from a child that closes stdin early is logged rather than crashing the driver. The jj VCS workspace pre-create cleanup now logs failures via Effect.logWarning instead of swallowing them in a bare catch. Protocol error-code arrays gained @type {const} assertions so they are genuinely readonly at runtime, matching their declared tuple types. Empty subscribe:[] subscriptions are now treated as a real filter that denies cross-run devtools access rather than as no filter. And the fault-injection e2e harness now awaits stallSandbox.release()’s filesystem restore and guards client-side WebSocket JSON.parse calls against malformed frames.