Skip to main content
Smithers ships a scoring system that lets you attach evaluation functions to tasks. Scorers run after a task finishes and produce a numeric score between 0 and 1, an optional human-readable reason, and optional metadata. Scores are persisted in SQLite alongside your run data so you can query, aggregate, and visualize quality over time.

Core Concepts

Scorer

A scorer is a named function that takes a ScorerInput and returns a ScoreResult:
import { createScorer } from "smithers-orchestrator/scorers";

const myScorer = createScorer({
  id: "length-check",
  name: "Length Check",
  description: "Checks output meets minimum length",
  score: async ({ output }) => {
    const text = String(output);
    const score = Math.min(text.length / 500, 1);
    return { score, reason: `Output is ${text.length} chars` };
  },
});

ScoreResult

Every scorer returns a ScoreResult:
FieldTypeDescription
scorenumber (0-1)Normalized quality score
reasonstring?Human-readable explanation
metaRecord<string, unknown>?Arbitrary metadata for downstream use

ScorerInput

The input passed to every scorer function:
FieldTypeDescription
inputunknownThe original task input/prompt
outputunknownThe task’s produced output
groundTruthunknown?Expected output for comparison
contextunknown?Additional context (e.g. retrieved docs)
latencyMsnumber?How long the task took in milliseconds
outputSchemaZodObject?The Zod schema the output should match

Attaching Scorers to Tasks

Pass a scorers map to any <Task> component:
import { latencyScorer, schemaAdherenceScorer } from "smithers-orchestrator/scorers";

<Task
  id="analyze"
  agent={claude}
  output={outputs.analysis}
  scorers={{
    latency: { scorer: latencyScorer({ targetMs: 5000, maxMs: 30000 }) },
    schema: { scorer: schemaAdherenceScorer() },
  }}
>
  Analyze the codebase and produce a summary.
</Task>
Scorers fire asynchronously after the task finishes. They never block the workflow.

Sampling

Not every run needs every scorer. Use sampling to control evaluation frequency:
scorers={{
  relevancy: {
    scorer: relevancyScorer(judge),
    sampling: { type: "ratio", rate: 0.1 },  // 10% of runs
  },
  schema: {
    scorer: schemaAdherenceScorer(),
    sampling: { type: "all" },  // every run (default)
  },
}}
Sampling TypeBehavior
allRun on every task execution
ratioRun with probability rate
noneDisabled (useful for toggling)

Custom Scorers

createScorer

Build a scorer from a plain configuration object:
import { createScorer } from "smithers-orchestrator/scorers";

const myScorer = createScorer({
  id: "word-count",
  name: "Word Count",
  description: "Scores based on output word count",
  score: async ({ output }) => {
    const words = String(output).split(/\s+/).length;
    return { score: Math.min(words / 200, 1), reason: `${words} words` };
  },
});

llmJudge

Build an LLM-as-judge scorer that delegates evaluation to an AI agent. The judge receives a prompt constructed from promptTemplate and is expected to return JSON with score (0–1) and optional reason. If the response cannot be parsed, the scorer returns 0 with a diagnostic reason.
import { llmJudge } from "smithers-orchestrator/scorers";

const toneScorer = llmJudge({
  id: "professional-tone",
  name: "Professional Tone",
  description: "Evaluates if the output maintains a professional tone",
  judge,
  instructions: "You evaluate whether text maintains a professional, business-appropriate tone.",
  promptTemplate: ({ input, output }) =>
    `Rate the professionalism of this response (0-1 JSON).\n\nInput: ${String(input)}\n\nOutput: ${String(output)}`,
});
FieldTypeDescription
idstringUnique scorer identifier
namestringHuman-readable name
descriptionstringWhat this scorer evaluates
judgeAgentLikeThe agent that performs the evaluation
instructionsstringSystem-level instructions prepended to the prompt
promptTemplate(input: ScorerInput) => stringBuilds the prompt from the scorer input

Built-in Scorers

Smithers includes five built-in scorers:

Code-based (no LLM needed)

schemaAdherenceScorer() — Validates that the output conforms to the task’s Zod outputSchema. Returns 1.0 if safeParse succeeds, 0.0 if it fails (with validation issues in the reason). If no outputSchema is set, returns 1.0 with a skip note. latencyScorer({ targetMs, maxMs }) — Scores based on task execution time. Returns 1.0 at or below targetMs, linearly interpolates to 0.0 at maxMs, and returns 0.0 above maxMs. If no latency data is available, returns 1.0 with a skip note.

LLM-based (requires a judge agent)

All three LLM-based scorers accept an AgentLike as judge. They construct a prompt with evaluation criteria, call judge.generate(), and parse the JSON response. relevancyScorer(judge) — Evaluates whether the output is relevant to and addresses the input prompt. Considers both direct answers and related context. Scores from 0.0 (completely irrelevant) to 1.0 (perfectly relevant). toxicityScorer(judge) — Detects toxic, harmful, offensive, or inappropriate content. Checks for hate speech, harassment, threats, discriminatory language, explicit content, and dangerous instructions. The score represents the level of toxicity: 0.0 means clean, 1.0 means highly toxic. faithfulnessScorer(judge) — Checks whether the output is faithful to the provided context without hallucinations. Every claim in the output should be supported by the context. Scores from 0.0 (entirely fabricated) to 1.0 (completely faithful). If no context is provided, evaluates internal consistency.

Persistence

All scores are stored in the _smithers_scorers table:
ColumnTypeDescription
idTEXTUnique score row ID
run_idTEXTParent run
node_idTEXTTask that was scored
iterationINTEGERTask iteration
attemptINTEGERTask attempt number
scorer_idTEXTScorer identifier
scorer_nameTEXTHuman-readable scorer name
sourceTEXTlive or batch
scoreREALThe 0-1 score
reasonTEXTOptional explanation
meta_jsonTEXTJSON metadata
input_jsonTEXTSerialized scorer input
output_jsonTEXTSerialized task output
latency_msREALTask execution latency
scored_at_msINTEGERWhen the score was computed
duration_msREALHow long the scorer itself took

Execution Modes

Async (live scoring)

When scorers are attached to a <Task>, they run via runScorersAsync — fire-and-forget execution that never blocks the workflow. All scorers run concurrently with unbounded concurrency. Errors are logged but do not fail the task.

Batch (offline evaluation)

For testing and offline evaluation, call runScorersBatch directly. It runs all scorers, waits for completion, and returns a map of key to ScoreResult | null:
import { runScorersBatch } from "smithers-orchestrator/scorers";

const results = await runScorersBatch(
  { schema: { scorer: schemaAdherenceScorer() } },
  { runId: "test", nodeId: "analyze", iteration: 0, attempt: 0, input: "...", output: { summary: "..." } },
  adapter,
);
// { schema: { score: 1, reason: "Output matches schema" } }
Both modes persist results to the _smithers_scorers table with a source column of "live" or "batch".

Aggregation

Query aggregate statistics across runs:
import { aggregateScores } from "smithers-orchestrator/scorers";

const stats = await aggregateScores(adapter, { runId: "run-123" });

Filter Options

FilterTypeDescription
runIdstringFilter to a specific run
nodeIdstringFilter to a specific task node
scorerIdstringFilter to a specific scorer
All filters are optional and can be combined.

Returned Statistics

Each entry in the returned array contains:
FieldTypeDescription
scorerIdstringScorer identifier
scorerNamestringHuman-readable scorer name
countnumberTotal number of scores
meannumberAverage score
minnumberLowest score
maxnumberHighest score
p50numberMedian score (50th percentile)
stddevnumberStandard deviation (population)

Events

Three event types are emitted during the scorer lifecycle: ScorerStarted — Emitted when a scorer begins evaluation.
FieldType
scorerIdstring
scorerNamestring
nodeIdstring
runIdstring
ScorerFinished — Emitted when a scorer completes successfully. Includes the score value.
FieldType
scorerIdstring
scorerNamestring
scorenumber
nodeIdstring
runIdstring
ScorerFailed — Emitted when a scorer throws an error. Includes the error.
FieldType
scorerIdstring
scorerNamestring
errorunknown
nodeIdstring
runIdstring

Metrics

Smithers tracks four Effect metrics for scorer execution:
MetricTypeDescription
smithers.scorers.startedCounterIncremented when a scorer begins
smithers.scorers.finishedCounterIncremented when a scorer completes
smithers.scorers.failedCounterIncremented when a scorer throws
smithers.scorer.duration_msHistogramScorer execution time (exponential buckets, ~10ms to ~80s)
These metrics are available through the standard Effect metric system and can be exported via OTLP. See Monitoring and Logs.

CLI

View scores from the command line:
# Show all scores for a run
smithers scores <run_id>

# Show scores for a specific node
smithers scores <run_id> --node analyze