Skip to main content
Not all tasks are created equal. An implementation task needs a model that writes correct code. A review task needs a model that reasons about architecture. A simple file-reading task needs a model that is fast and cheap. Choosing the right model for each task is the difference between a workflow that works and one that burns money on overkill — or worse, fails because you used a cheap model for a hard job.

Codex (gpt-5.3-codex) — Implementation

Codex is the strongest model for writing and modifying code. Use it for:
  • Implementing features
  • Fixing bugs
  • Running and interpreting tests
  • Refactoring code
  • Fixing review issues
Reasoning effort: Set to high by default. Use xhigh for especially complex tasks — architectural refactors, multi-file changes with tricky dependencies.

Claude Opus (claude-opus-4-6) — Planning and Review

Claude Opus is the strongest model for reasoning about architecture and evaluating code quality. Use it for:
  • Research and codebase exploration
  • Planning implementation steps
  • Code review
  • Report generation
  • Orchestration logic and tool calling

Claude Sonnet (claude-sonnet-4-5-20250929) — Simple Tasks

Sonnet is fast, cheap, and good enough for straightforward work. Use it for:
  • Simple tool calling (reading files, running commands)
  • Lightweight reviews where deep reasoning is not needed
  • Report aggregation from structured data
  • Tasks where a more expensive model would be wasteful

Summary Table

Task TypeRecommended ModelWhy
Implementing codeCodexStrongest at code generation
Reviewing codeClaude Opus + Codex (parallel)Two models catch more issues
Research and planningClaude OpusStrongest at architectural reasoning
Running tests / validationCodexGood at interpreting build output
Simple tool callsClaude SonnetFast, cheap, sufficient
Report generationClaude Sonnet or OpusDepends on complexity
Ticket discoveryCodex or Claude OpusBoth work well for codebase analysis
The parallel review row deserves special attention. Running two different models on the same review catches more bugs than running one model twice. They have different blind spots.

CLI Agents vs AI SDK Agents

Smithers supports two ways to run each model. The choice depends on how you pay.

CLI Agents (subscription-based)

Use ClaudeCodeAgent, CodexAgent, and KimiAgent when you have a subscription to the respective service. The agent runs as a subprocess using the CLI binary, which provides its native tool ecosystem — file editing, shell access, and everything else the CLI supports.
import { ClaudeCodeAgent, CodexAgent, KimiAgent } from "smithers-orchestrator";

const claude = new ClaudeCodeAgent({
  model: "claude-opus-4-6",
  systemPrompt: SYSTEM_PROMPT,
  dangerouslySkipPermissions: true,
  timeoutMs: 30 * 60 * 1000,
});

const codex = new CodexAgent({
  model: "gpt-5.3-codex",
  systemPrompt: SYSTEM_PROMPT,
  yolo: true,
  config: { model_reasoning_effort: "high" },
  timeoutMs: 30 * 60 * 1000,
});

const kimi = new KimiAgent({
  model: "kimi-latest",
  systemPrompt: SYSTEM_PROMPT,
  thinking: true,
  timeoutMs: 30 * 60 * 1000,
});

AI SDK Agents (API billing)

Use AnthropicAgent and OpenAIAgent when you want per-token billing instead of a subscription, or when you want sandboxed tools from Smithers:
import { stepCountIs } from "ai";
import { AnthropicAgent, OpenAIAgent, tools } from "smithers-orchestrator";

const claude = new AnthropicAgent({
  model: "claude-opus-4-6",
  tools,
  instructions: SYSTEM_PROMPT,
  stopWhen: stepCountIs(100),
});

const codex = new OpenAIAgent({
  model: "gpt-5.3-codex",
  tools,
  instructions: SYSTEM_PROMPT,
  stopWhen: stepCountIs(100),
});

Dual-Agent Setup

In practice, you want the flexibility to switch between CLI and API agents without rewriting your workflow. Define both and let an environment variable decide:
// agents.ts
import { stepCountIs, type ToolSet } from "ai";
import {
  AnthropicAgent,
  ClaudeCodeAgent,
  CodexAgent,
  KimiAgent,
  OpenAIAgent,
} from "smithers-orchestrator";
import { tools as smithersTools } from "smithers-orchestrator";
import { SYSTEM_PROMPT } from "./system-prompt";

const tools = smithersTools as ToolSet;
const USE_CLI = process.env.USE_CLI_AGENTS !== "0" && process.env.USE_CLI_AGENTS !== "false";
const UNSAFE = process.env.SMITHERS_UNSAFE === "1";

// --- Codex ---
const CODEX_MODEL = process.env.CODEX_MODEL ?? "gpt-5.3-codex";

const codexApi = new OpenAIAgent({
  model: CODEX_MODEL,
  tools,
  instructions: SYSTEM_PROMPT,
  stopWhen: stepCountIs(100),
  maxOutputTokens: 8192,
});

const codexCli = new CodexAgent({
  model: CODEX_MODEL,
  systemPrompt: SYSTEM_PROMPT,
  yolo: UNSAFE,
  config: { model_reasoning_effort: "high" },
  timeoutMs: 30 * 60 * 1000,
});

export const codex = USE_CLI ? codexCli : codexApi;

// --- Claude ---
const CLAUDE_MODEL = process.env.CLAUDE_MODEL ?? "claude-opus-4-6";

const claudeApi = new AnthropicAgent({
  model: CLAUDE_MODEL,
  tools,
  instructions: SYSTEM_PROMPT,
  stopWhen: stepCountIs(100),
  maxOutputTokens: 8192,
});

const claudeCli = new ClaudeCodeAgent({
  model: CLAUDE_MODEL,
  systemPrompt: SYSTEM_PROMPT,
  dangerouslySkipPermissions: UNSAFE,
  timeoutMs: 30 * 60 * 1000,
});

export const claude = USE_CLI ? claudeCli : claudeApi;

// --- Kimi ---
const KIMI_MODEL = process.env.KIMI_MODEL ?? "kimi-latest";

const kimiCli = new KimiAgent({
  model: KIMI_MODEL,
  systemPrompt: SYSTEM_PROMPT,
  thinking: true,
  timeoutMs: 30 * 60 * 1000,
});

export const kimi = kimiCli; // Kimi is CLI-only
Switch at launch time:
# Use CLI agents (subscription)
USE_CLI_AGENTS=1 SMITHERS_UNSAFE=1 bunx smithers up workflow.tsx

# Use API agents
USE_CLI_AGENTS=0 bunx smithers up workflow.tsx
Your workflow code never changes. Only the agent wiring does.

Assigning Models to Steps

In a typical workflow with a review loop, assign models by what they are good at:
StepAgentReasoning
DiscovercodexGood at codebase analysis and structured output
ResearchclaudeStrong at finding patterns and synthesizing information
PlanclaudeBest at architectural reasoning
ImplementcodexStrongest at writing code
ValidatecodexGood at running and interpreting tests
Review (parallel)claude + codexTwo models catch different issue types
ReviewFixcodexFixing code is implementation work
ReportclaudeGood at summarization
Notice the pattern: Codex does the hands-on coding, Claude does the thinking and judging. The review step uses both because that is where coverage matters most.

Codex Reasoning Effort

The model_reasoning_effort config controls how much thinking Codex does before it generates. Higher effort produces better results but costs more time and tokens.
const codex = new CodexAgent({
  model: "gpt-5.3-codex",
  config: { model_reasoning_effort: "high" },  // default recommendation
});
LevelUse when
mediumSimple, well-defined changes with clear instructions
highDefault. Most implementation and review tasks
xhighComplex architectural changes, multi-file refactors, tricky edge cases
When in doubt, use high. You can always bump it to xhigh for the tasks that keep failing.

Next Steps