Skip to main content
Smithers ships a scoring system that lets you attach evaluation functions to tasks. Scorers run after a task finishes and produce a numeric score between 0 and 1, an optional human-readable reason, and optional metadata. Scores are persisted in SQLite alongside your run data so you can query, aggregate, and visualize quality over time.

Core Concepts

Scorer

A scorer is a named function that takes a ScorerInput and returns a ScoreResult:
import { createScorer } from "smithers-orchestrator/scorers";

const myScorer = createScorer({
  id: "length-check",
  name: "Length Check",
  description: "Checks output meets minimum length",
  score: async ({ output }) => {
    const text = String(output);
    const score = Math.min(text.length / 500, 1);
    return { score, reason: `Output is ${text.length} chars` };
  },
});

ScoreResult

Every scorer returns a ScoreResult:
FieldTypeDescription
scorenumber (0-1)Normalized quality score
reasonstring?Human-readable explanation
metaRecord<string, unknown>?Arbitrary metadata for downstream use

ScorerInput

The input passed to every scorer function:
FieldTypeDescription
inputunknownThe original task input/prompt
outputunknownThe task’s produced output
groundTruthunknown?Expected output for comparison
contextunknown?Additional context (e.g. retrieved docs)
latencyMsnumber?How long the task took in milliseconds
outputSchemaZodObject?The Zod schema the output should match

Attaching Scorers to Tasks

Pass a scorers map to any <Task> component:
import { latencyScorer, schemaAdherenceScorer } from "smithers-orchestrator/scorers";

<Task
  id="analyze"
  agent={claude}
  output={outputs.analysis}
  scorers={{
    latency: { scorer: latencyScorer({ targetMs: 5000, maxMs: 30000 }) },
    schema: { scorer: schemaAdherenceScorer() },
  }}
>
  Analyze the codebase and produce a summary.
</Task>
Scorers fire asynchronously after the task finishes. They never block the workflow.

Sampling

Not every run needs every scorer. Use sampling to control evaluation frequency:
scorers={{
  relevancy: {
    scorer: relevancyScorer(judge),
    sampling: { type: "ratio", rate: 0.1 },  // 10% of runs
  },
  schema: {
    scorer: schemaAdherenceScorer(),
    sampling: { type: "all" },  // every run (default)
  },
}}
Sampling TypeBehavior
allRun on every task execution
ratioRun with probability rate
noneDisabled (useful for toggling)

Built-in Scorers

Smithers includes five built-in scorers:

Code-based (no LLM needed)

  • schemaAdherenceScorer() — Validates that the output matches the task’s Zod schema. Returns 1.0 if valid, 0.0 if not.
  • latencyScorer({ targetMs, maxMs }) — Scores based on execution time. Returns 1.0 at or below targetMs, linearly decreasing to 0.0 at maxMs.

LLM-based (requires a judge agent)

  • relevancyScorer(judge) — Uses an LLM to evaluate whether the output is relevant to the input.
  • toxicityScorer(judge) — Uses an LLM to detect toxic or harmful content.
  • faithfulnessScorer(judge) — Uses an LLM to check if the output is faithful to the provided context.

Persistence

All scores are stored in the _smithers_scorers table:
ColumnTypeDescription
idTEXTUnique score row ID
run_idTEXTParent run
node_idTEXTTask that was scored
iterationINTEGERTask iteration
attemptINTEGERTask attempt number
scorer_idTEXTScorer identifier
scorer_nameTEXTHuman-readable scorer name
sourceTEXTlive or batch
scoreREALThe 0-1 score
reasonTEXTOptional explanation
meta_jsonTEXTJSON metadata
input_jsonTEXTSerialized scorer input
output_jsonTEXTSerialized task output
latency_msREALTask execution latency
scored_at_msINTEGERWhen the score was computed
duration_msREALHow long the scorer itself took

Aggregation

Query aggregate statistics across runs:
import { aggregateScores } from "smithers-orchestrator/scorers";

const stats = await aggregateScores(adapter, { runId: "run-123" });
// [{ scorerId, scorerName, count, mean, min, max, p50, stddev }]

Events

Three event types are emitted during scoring:
  • ScorerStarted — A scorer has begun evaluation
  • ScorerFinished — A scorer completed successfully
  • ScorerFailed — A scorer threw an error

CLI

View scores from the command line:
# Show all scores for a run
smithers scores <run_id>

# Show scores for a specific node
smithers scores <run_id> --node analyze