Core Concepts
Scorer
A scorer is a named function that takes aScorerInput and returns a ScoreResult:
ScoreResult
Every scorer returns aScoreResult:
| Field | Type | Description |
|---|---|---|
score | number (0-1) | Normalized quality score |
reason | string? | Human-readable explanation |
meta | Record<string, unknown>? | Arbitrary metadata for downstream use |
ScorerInput
The input passed to every scorer function:| Field | Type | Description |
|---|---|---|
input | unknown | The original task input/prompt |
output | unknown | The task’s produced output |
groundTruth | unknown? | Expected output for comparison |
context | unknown? | Additional context (e.g. retrieved docs) |
latencyMs | number? | How long the task took in milliseconds |
outputSchema | ZodObject? | The Zod schema the output should match |
Attaching Scorers to Tasks
Pass ascorers map to any <Task> component:
Sampling
Not every run needs every scorer. Use sampling to control evaluation frequency:| Sampling Type | Behavior |
|---|---|
all | Run on every task execution |
ratio | Run with probability rate |
none | Disabled (useful for toggling) |
Custom Scorers
createScorer
Build a scorer from a plain configuration object:llmJudge
Build an LLM-as-judge scorer that delegates evaluation to an AI agent. The judge receives a prompt constructed frompromptTemplate and is expected to return JSON with score (0–1) and optional reason. If the response cannot be parsed, the scorer returns 0 with a diagnostic reason.
| Field | Type | Description |
|---|---|---|
id | string | Unique scorer identifier |
name | string | Human-readable name |
description | string | What this scorer evaluates |
judge | AgentLike | The agent that performs the evaluation |
instructions | string | System-level instructions prepended to the prompt |
promptTemplate | (input: ScorerInput) => string | Builds the prompt from the scorer input |
Built-in Scorers
Smithers includes five built-in scorers:Code-based (no LLM needed)
schemaAdherenceScorer() — Validates that the output conforms to the task’s Zod outputSchema. Returns 1.0 if safeParse succeeds, 0.0 if it fails (with validation issues in the reason). If no outputSchema is set, returns 1.0 with a skip note.
latencyScorer({ targetMs, maxMs }) — Scores based on task execution time. Returns 1.0 at or below targetMs, linearly interpolates to 0.0 at maxMs, and returns 0.0 above maxMs. If no latency data is available, returns 1.0 with a skip note.
LLM-based (requires a judge agent)
All three LLM-based scorers accept anAgentLike as judge. They construct a prompt with evaluation criteria, call judge.generate(), and parse the JSON response.
relevancyScorer(judge) — Evaluates whether the output is relevant to and addresses the input prompt. Considers both direct answers and related context. Scores from 0.0 (completely irrelevant) to 1.0 (perfectly relevant).
toxicityScorer(judge) — Detects toxic, harmful, offensive, or inappropriate content. Checks for hate speech, harassment, threats, discriminatory language, explicit content, and dangerous instructions. The score represents the level of toxicity: 0.0 means clean, 1.0 means highly toxic.
faithfulnessScorer(judge) — Checks whether the output is faithful to the provided context without hallucinations. Every claim in the output should be supported by the context. Scores from 0.0 (entirely fabricated) to 1.0 (completely faithful). If no context is provided, evaluates internal consistency.
Persistence
All scores are stored in the_smithers_scorers table:
| Column | Type | Description |
|---|---|---|
id | TEXT | Unique score row ID |
run_id | TEXT | Parent run |
node_id | TEXT | Task that was scored |
iteration | INTEGER | Task iteration |
attempt | INTEGER | Task attempt number |
scorer_id | TEXT | Scorer identifier |
scorer_name | TEXT | Human-readable scorer name |
source | TEXT | live or batch |
score | REAL | The 0-1 score |
reason | TEXT | Optional explanation |
meta_json | TEXT | JSON metadata |
input_json | TEXT | Serialized scorer input |
output_json | TEXT | Serialized task output |
latency_ms | REAL | Task execution latency |
scored_at_ms | INTEGER | When the score was computed |
duration_ms | REAL | How long the scorer itself took |
Execution Modes
Async (live scoring)
When scorers are attached to a<Task>, they run via runScorersAsync — fire-and-forget execution that never blocks the workflow. All scorers run concurrently with unbounded concurrency. Errors are logged but do not fail the task.
Batch (offline evaluation)
For testing and offline evaluation, callrunScorersBatch directly. It runs all scorers, waits for completion, and returns a map of key to ScoreResult | null:
_smithers_scorers table with a source column of "live" or "batch".
Aggregation
Query aggregate statistics across runs:Filter Options
| Filter | Type | Description |
|---|---|---|
runId | string | Filter to a specific run |
nodeId | string | Filter to a specific task node |
scorerId | string | Filter to a specific scorer |
Returned Statistics
Each entry in the returned array contains:| Field | Type | Description |
|---|---|---|
scorerId | string | Scorer identifier |
scorerName | string | Human-readable scorer name |
count | number | Total number of scores |
mean | number | Average score |
min | number | Lowest score |
max | number | Highest score |
p50 | number | Median score (50th percentile) |
stddev | number | Standard deviation (population) |
Events
Three event types are emitted during the scorer lifecycle:ScorerStarted — Emitted when a scorer begins evaluation.
| Field | Type |
|---|---|
scorerId | string |
scorerName | string |
nodeId | string |
runId | string |
ScorerFinished — Emitted when a scorer completes successfully. Includes the score value.
| Field | Type |
|---|---|
scorerId | string |
scorerName | string |
score | number |
nodeId | string |
runId | string |
ScorerFailed — Emitted when a scorer throws an error. Includes the error.
| Field | Type |
|---|---|
scorerId | string |
scorerName | string |
error | unknown |
nodeId | string |
runId | string |
Metrics
Smithers tracks four Effect metrics for scorer execution:| Metric | Type | Description |
|---|---|---|
smithers.scorers.started | Counter | Incremented when a scorer begins |
smithers.scorers.finished | Counter | Incremented when a scorer completes |
smithers.scorers.failed | Counter | Incremented when a scorer throws |
smithers.scorer.duration_ms | Histogram | Scorer execution time (exponential buckets, ~10ms to ~80s) |