Scorers API - Smithers

A scorer grades a task’s output and returns a number between 0 and 1. Attach scorers to a Task via its scorers prop; they run after the task completes and never block the workflow. Each result is persisted to the _smithers_scorers table so you can aggregate scores across runs. All scorer values and types are re-exported from the smithers-orchestrator facade, which is canonical. The smithers-orchestrator/scorers subpath exports the same surface.

import {
  createScorer,
  llmJudge,
  faithfulnessScorer,
  schemaAdherenceScorer,
  latencyScorer,
  aggregateScores,
  runScorersBatch,
} from "smithers-orchestrator";
import type { Scorer, ScoreResult, ScorersMap } from "smithers-orchestrator";

The component that hosts scorers (Task) is returned by the factory, not imported. See the Components reference for the scorers prop and ScorersMap for its shape.

Concepts

A Scorer is a named, self-describing evaluator. Its score function is a ScorerFn: given a ScorerInput, it returns a Promise<ScoreResult>.

Scorer

object

A named scorer.

Show Scorer

string

required

Unique identifier. Persisted as scorerId and used to filter aggregates.

name

string

required

Human-readable name. Persisted as scorerName.

description

string

required

Description of what the scorer evaluates.

score

ScorerFn

required

The scoring function.

ScorerInput

object

The argument passed to a ScorerFn. Built from the task’s input, output, and metadata at scoring time.

Show ScorerInput

input

unknown

required

The original task input or prompt.

output

unknown

required

The task’s produced output.

groundTruth

unknown

Expected output for comparison.

context

unknown

Additional context such as retrieved documents.

latencyMs

number

How long the task took, in milliseconds.

outputSchema

ZodObject

The Zod schema the output should match.

ScoreResult

object

What a ScorerFn returns.

Show ScoreResult

score

number

required

Normalized quality score between 0 and 1.

reason

string

Human-readable explanation of the score.

createScorer

Build a custom Scorer from a plain config object. The returned scorer is just its config; the work lives in your score function.

function createScorer(config: CreateScorerConfig): Scorer;

config

CreateScorerConfig

required

Show CreateScorerConfig

string

required

Unique identifier.

name

string

required

Human-readable name.

description

string

required

What the scorer evaluates.

score

ScorerFn

required

Async function returning a ScoreResult.

Scorer

object

The named scorer, ready to bind to a task.

const wordCount = createScorer({
  id: "word-count",
  name: "Word Count",
  description: "Scores toward 1.0 as output approaches 200 words",
  score: async ({ output }) => ({
    score: Math.min(String(output).split(/\s+/).length / 200, 1),
  }),
});

Source createScorer.js · Tests create-scorer.test.js · See also llmJudge

llmJudge

Build an LLM-as-judge scorer that delegates evaluation to an agent. The judge is prompted with your instructions plus the output of promptTemplate, and is expected to reply with JSON { "score": <0-1>, "reason": "<text>" }. The reply is parsed leniently (a bare number works, and braces inside reason do not truncate the match), the score is clamped to 0–1, and an unparseable reply scores 0.

function llmJudge(config: LlmJudgeConfig): Scorer;

config

LlmJudgeConfig

required

Show LlmJudgeConfig

string

required

Unique identifier.

name

string

required

Human-readable name.

description

string

required

What the judge evaluates.

judge

AgentLike

required

The agent that performs the evaluation.

instructions

string

required

System-level instructions prepended to every prompt.

promptTemplate

(input: ScorerInput) => string

required

Builds the judge prompt from the scorer input. Instruct the judge to respond with the { score, reason } JSON object.

Scorer

object

A scorer whose score calls judge.generate(...) and parses the reply.

const tone = llmJudge({
  id: "tone",
  name: "Professional Tone",
  description: "Evaluates professional tone",
  judge,
  instructions: "You evaluate text for professional tone.",
  promptTemplate: ({ output }) =>
    `Rate the professionalism (0-1 JSON):\n\n${String(output)}`,
});

Source llmJudge.js · Tests create-scorer.test.js · See also Built-in scorers

Built-in scorers

Each built-in is a factory that returns a Scorer. The three judge-based scorers take an AgentLike judge; the two deterministic ones do not call an agent.

Scorer	What it measures	Factory
`faithfulnessScorer`	Output is grounded in `context`, no hallucinations	`faithfulnessScorer(judge)`
`relevancyScorer`	Output addresses the `input`	`relevancyScorer(judge)`
`toxicityScorer`	Toxic, harmful, or inappropriate content (higher = more toxic)	`toxicityScorer(judge)`
`schemaAdherenceScorer`	Output passes the task’s `outputSchema` (1 valid, 0 invalid)	`schemaAdherenceScorer()`
`latencyScorer`	Execution time vs. budget (1 at/below target, 0 at/above max)	`latencyScorer({ targetMs, maxMs })`

const grounded = faithfulnessScorer(judge);
const onSchema = schemaAdherenceScorer();
const fast = latencyScorer({ targetMs: 5000, maxMs: 20000 });

schemaAdherenceScorer and latencyScorer no-op (score 1) when the input lacks an outputSchema or latencyMs. toxicityScorer scores the level of toxicity, so clean text scores near 0.

smithersScorers

smithersScorers is the Drizzle table backing scorer persistence (_smithers_scorers). Every scorer result is inserted here as a ScoreRow; aggregateScores reads from it. Use it for direct queries against your store. Source faithfulnessScorer.js · schema.js · Tests builtins.test.js · See also ScoreRow

Running scorers

Bound scorers run automatically when a task completes, so you rarely call these directly. They are exported for custom hosts, batch evaluation, and tooling.

runScorersAsync

Fire-and-forget execution for live scoring. Runs every binding concurrently via Effect.runFork and returns immediately, so scoring never blocks the workflow. Failures are logged, not thrown.

function runScorersAsync(
  scorers: ScorersMap,
  ctx: ScorerContext,
  adapter: SmithersDb | null,
  eventBus?: EventBus | null,
): void;

scorers

ScorersMap

required

The keyed bindings to run.

ctx

ScorerContext

required

Run/node coordinates plus the data the scorers grade. See ScorerContext.

adapter

SmithersDb | null

required

Database adapter to persist results, or null to skip persistence.

eventBus

EventBus | null

Optional bus that receives ScorerStarted / ScorerFinished / ScorerFailed events.

runScorersBatch

Blocking execution for batch and test evaluation. Runs every binding concurrently and resolves to a map of binding key to ScoreResult (or null when a scorer is sampled out or fails).

function runScorersBatch(
  scorers: ScorersMap,
  ctx: ScorerContext,
  adapter: SmithersDb | null,
  eventBus?: EventBus | null,
): Promise<Record<string, ScoreResult | null>>;

Promise<Record<string, ScoreResult | null>>

object

One entry per binding key, in the order the scorers were declared.

const results = await runScorersBatch(
  { quality: { scorer: tone } },
  {
    runId: "RUN_ID",
    nodeId: "NODE_ID",
    iteration: 0,
    attempt: 1,
    input: "Summarize the article.",
    output: "...",
  },
  null,
);
// results.quality?.score

aggregateScores

Compute per-scorer statistics across persisted results: count, mean, min, max, p50, and stddev. Filter to a run, node, or scorer.

function aggregateScores(
  adapter: SmithersDb,
  opts?: AggregateOptions,
): Promise<AggregateScore[]>;

adapter

SmithersDb

required

Database adapter to read scorer rows from.

opts

AggregateOptions

Show AggregateOptions

runId

string

Filter to a specific run.

nodeId

string

Filter to a specific node.

scorerId

string

Filter to a specific scorer.

Promise<AggregateScore[]>

object

One row per scorer, ordered by scorer name.

Show AggregateScore

scorerId

string

scorerName

string

count

number

Number of scores included.

mean

number

min

number

max

number

p50

number

Median, computed in memory.

stddev

number

const stats = await aggregateScores(adapter, { runId: "RUN_ID" });

Scores for a run are also viewable from the CLI:

bunx smithers-orchestrator scores RUN_ID

Source run-scorers.js · aggregate.js · Tests run-scorers.test.js · aggregate.test.js · See also ScorerContext, AggregateScore

To wire scorers into a workflow and read them back, see the Evals quickstart. For the full type surface, see the Types reference. For the scorers prop on Task, see the Components reference.

​Concepts

​createScorer

​llmJudge

​Built-in scorers

​smithersScorers

​Running scorers

​runScorersAsync

​runScorersBatch

​aggregateScores

Concepts

createScorer

llmJudge

Built-in scorers

smithersScorers

Running scorers

runScorersAsync

runScorersBatch

aggregateScores