Core Concepts
Scorer
A scorer is a named function that takes aScorerInput and returns a ScoreResult:
ScoreResult
Every scorer returns aScoreResult:
| Field | Type | Description |
|---|---|---|
score | number (0-1) | Normalized quality score |
reason | string? | Human-readable explanation |
meta | Record<string, unknown>? | Arbitrary metadata for downstream use |
ScorerInput
The input passed to every scorer function:| Field | Type | Description |
|---|---|---|
input | unknown | The original task input/prompt |
output | unknown | The task’s produced output |
groundTruth | unknown? | Expected output for comparison |
context | unknown? | Additional context (e.g. retrieved docs) |
latencyMs | number? | How long the task took in milliseconds |
outputSchema | ZodObject? | The Zod schema the output should match |
Attaching Scorers to Tasks
Pass ascorers map to any <Task> component:
Sampling
Not every run needs every scorer. Use sampling to control evaluation frequency:| Sampling Type | Behavior |
|---|---|
all | Run on every task execution |
ratio | Run with probability rate |
none | Disabled (useful for toggling) |
Built-in Scorers
Smithers includes five built-in scorers:Code-based (no LLM needed)
schemaAdherenceScorer()— Validates that the output matches the task’s Zod schema. Returns 1.0 if valid, 0.0 if not.latencyScorer({ targetMs, maxMs })— Scores based on execution time. Returns 1.0 at or belowtargetMs, linearly decreasing to 0.0 atmaxMs.
LLM-based (requires a judge agent)
relevancyScorer(judge)— Uses an LLM to evaluate whether the output is relevant to the input.toxicityScorer(judge)— Uses an LLM to detect toxic or harmful content.faithfulnessScorer(judge)— Uses an LLM to check if the output is faithful to the provided context.
Persistence
All scores are stored in the_smithers_scorers table:
| Column | Type | Description |
|---|---|---|
id | TEXT | Unique score row ID |
run_id | TEXT | Parent run |
node_id | TEXT | Task that was scored |
iteration | INTEGER | Task iteration |
attempt | INTEGER | Task attempt number |
scorer_id | TEXT | Scorer identifier |
scorer_name | TEXT | Human-readable scorer name |
source | TEXT | live or batch |
score | REAL | The 0-1 score |
reason | TEXT | Optional explanation |
meta_json | TEXT | JSON metadata |
input_json | TEXT | Serialized scorer input |
output_json | TEXT | Serialized task output |
latency_ms | REAL | Task execution latency |
scored_at_ms | INTEGER | When the score was computed |
duration_ms | REAL | How long the scorer itself took |
Aggregation
Query aggregate statistics across runs:Events
Three event types are emitted during scoring:ScorerStarted— A scorer has begun evaluationScorerFinished— A scorer completed successfullyScorerFailed— A scorer threw an error