triage-run takes a single run id, pulls that run’s state and recent event log,
diagnoses the most likely root cause, and recommends the exact next move: fix,
rewind, retry, or escalate. Reach for it when a run failed or stalled and you
want a grounded read plus a concrete command instead of eyeballing the logs.
targetRunId, not runId: the engine reserves input.runId for
a run’s own id, so a workflow that inspects another run needs a distinct name.
Stages
- gather: deterministic (no agent): shells
inspect <targetRunId> --format jsonandevents <targetRunId>, returning the runstate, its run-levelrunError, itsfailingNodes, anypendingApprovalsthe run is suspended on, the tail oflastEvents, anokflag (false when inspect could not be read), and a one-linesummary. - diagnose: an agent reads the gathered evidence and names the single most likely
rootCauseHypothesis, with supportingevidenceand aconfidencelevel. Pending approvals are read as a stuck state, not a failure; recurring identical errors point at a persistent defect, one-off network resets at a transient one. - recommend: an agent proposes one
recommendedAction(fix|rewind|retry|escalate), the exactcommandto run next, and arationale. A run waiting on an approval is always escalated, never rewound or retried.
Inputs
| Input | Type | Default |
|---|---|---|
targetRunId | string | (required) the failed or stuck run to triage |