Production Hardening

This page covers storage durability, access scoping, execution isolation, secrets handling, cache policy, and audit for production Gateway deployments. Treat the Gateway, database, workflow modules, and sandbox workers as one operational boundary.

Required Checks

Run these checks before promoting a workflow module:

pnpm check:effect
pnpm check:deps
pnpm -r typecheck
pnpm test

The default CI workflow (.github/workflows/ci.yml) runs check:effect, check:deps, typecheck, and test on pull requests and pushes to main.

Persistence

Use one SQLite database file per deployment and place it on durable storage, or run against PostgreSQL for managed, multi-connection storage (see PostgreSQL and PGlite below). The internal tables are created and migrated idempotently on startup on either backend. Recommended database practices:

Back up the database file and its WAL files together.
Keep PRAGMA foreign_keys = ON; Smithers enables it during schema setup and uses referential checks for core run artifacts.
Keep run IDs stable across resume attempts.
Use the Gateway event stream sequence numbers for reconnects; clients should resume from the last seen seq.
Avoid manually deleting internal rows. Delete whole runs through supported administrative paths so dependent frames, node diffs, and audit rows stay consistent.

PostgreSQL and PGlite

openSmithersBackend(schemas, opts) is the async resolver used by the local Gateway path. It resolves opts.backend, then SMITHERS_BACKEND=sqlite|pglite|postgres, then backend in .smithers/smithers.config.ts, defaulting fresh workspaces to PGlite under .smithers/pg/. If a legacy smithers.db has run data and no .smithers/migrated.json marker, boot fails with SMITHERS_MIGRATION_REQUIRED instead of silently opening an empty backend. Run smithers migrate to copy the SQLite store into PGlite or smithers migrate --to postgres --url <pg-url> for managed Postgres. An explicit backend is a Gateway boot/deployment or migration-diagnostic choice, not a flag clients should pass to ps, inspect, or other run controls. Once the Gateway is healthy, controllers use the same RPC/client API regardless of the store behind it. Migration opens the source SQLite database read-only, copies tables in bounded batches, verifies per-table counts, writes .smithers/migrated.json, and keeps smithers.db as a rollback backup unless you pass --keep-sqlite=false. createSmithersPostgres(schemas, opts) runs the same durable engine, and the same crash-and-resume guarantees, on PostgreSQL or an embedded PGlite through the SQL dialect seam in packages/db/src/dialect.js. Point it at managed Postgres with { provider: "postgres", connectionString }, pass a node-postgres connection config with { provider: "postgres", connection }, or run an in-process PGlite with { provider: "pglite", dataDir }. The factory is async and returns the same createSmithers API plus a close() teardown. pg, @electric-sql/pglite, and @electric-sql/pglite-socket are optional dependencies loaded only when you take this path; the explicit synchronous bun:sqlite path needs none of them. On Postgres, take database-native backups and connection-pool sizing in place of the SQLite file-and-WAL backup guidance above.

Electric Cloud Sync

Cloud GUI replicas use @smithers-orchestrator/electric-proxy in front of ElectricSQL. Run it only against managed PostgreSQL configured for logical replication: set wal_level=logical, create a publication that includes every _smithers_* table plus workflow output tables, and provision a replication slot for Electric. PGlite is not a valid Electric source because it runs with wal_level=replica and cannot create logical replication slots. The proxy is the public shape endpoint. It authenticates the same scoped Gateway tokens, maps run:read to read-only shape access, injects workspace/run/user predicates from the grant, strips Authorization before forwarding to Electric, and enforces 60 shape opens per minute, 50 active shapes, and a 4 MiB per-frame payload limit. Writes do not go through Electric shapes: clients post to /v1/electric/write, the Gateway applies the existing RPC mutation under scope checks, returns the PostgreSQL txid, and TanStack DB holds optimistic state until Electric streams that txid back.

Access Control

Expose the Gateway only behind TLS. Use scoped bearer grants for automation and short TTLs for human-triggered actions. Recommended scopes by client type:

Client	Typical scopes
Run dashboard	`run:read`, `approval:read`
Launch automation	`run:read`, `run:write`
Approval inbox	`run:read`, `approval:read`, `approval:submit`
Operator tools	`run:read`, `run:write`, `approval:submit`

Rotate token grants regularly and revoke grants when a user, CI job, or integration no longer needs access. For multi-tenant deployments, see Control Plane for org, project, usage, and audit primitives.

Execution Boundary

Sandbox workers run in an isolated environment so that untrusted workflow code cannot reach the Gateway database or host filesystem. The concrete controls are:

request and result bundles are written under the run sandbox directory
bundle manifests are size-bounded
patch and artifact paths are checked against path traversal
produced diffs require review unless autoAcceptDiffs is enabled
sandbox records and events are persisted for audit

Configure allowNetwork, container images, environment variables, ports, volumes, and CPU or memory limits per worker. Verify your chosen runtime (Docker, Kubernetes, etc.) actually enforces these limits before running untrusted code. For high-risk code generation, run sandbox workers in a separate account, namespace, or machine with no ambient production credentials.

Secrets

Never pass long-lived credentials through workflow input. Prefer short-lived tokens from the caller, scoped environment injection at the worker boundary, or a secret manager mounted only into the worker process that needs it. Operational rules:

Do not store provider keys in SQLite rows, run input, task output, or event payloads.
Redact logs before forwarding them to shared observability sinks.
Split launch permissions from approval permissions for workflows that can write files, create pull requests, or deploy.

Cache Policy

Use cache policy keys deliberately:

// Example: cache a step result across runs of the same workflow for up to 1 hour
cachePolicy: {
  scope: "workflow",
  ttlMs: 60 * 60 * 1000,
  version: "v2", // bump when prompt, model, or output semantics change
}

scope: "run" keeps reuse inside one run.
scope: "workflow" shares reuse across runs of the same workflow.
scope: "global" shares reuse across workflow names.
ttlMs bounds staleness; expired cache rows are treated as misses and refreshed.
version should change whenever prompt, model, provider, tool behavior, or output semantics change.

Cached payloads are still validated against the current output schema on every hit. A schema mismatch is a cache miss.

Audit Trail

For incident review, preserve:

Gateway access logs
Smithers run events
rows in _smithers_time_travel_audit, which record workflow rewind/replay events
sandbox bundle metadata and review decisions
approval decisions, notes, and actor IDs
deployment version and workflow module revision

Keep the audit log append-only from the perspective of normal operators.

Release Checklist

Before a production release:

CI is green on typecheck, dependency checks, and tests.
Database backups have been restored in a staging environment.
Gateway tokens are scoped and have bounded TTLs.
Sandbox runtime enforcement has been tested against the intended threat model.
Approval paths have a named owner and a fallback owner (see Approval).
Logs are retained long enough to investigate delayed workflow failures.

​Required Checks

​Persistence

​PostgreSQL and PGlite

​Electric Cloud Sync

​Access Control

​Execution Boundary

​Secrets

​Cache Policy

​Audit Trail

​Release Checklist