Backstage as an LLM agent: a 5-cap tool-calling loop with propose-only tools
Building the AgentLoop for incident-copilot-backend — what earns the 'agent' framing, and the five stop conditions that keep it cheap.
backstageaillmagentsclaudetypescriptmcp
Post 4 sketched a four-phase plan for AI plugins on Backstage. The fourth phase — an incident-investigation co-pilot — was the diagonal of the 2x2: read the catalog and change the world. The “change the world” part is what earns the word “agent” vs. “co-pilot.” This post is about that part.
The artifact is an AgentLoop class that ships in
@internal/plugin-incident-copilot-backend (in the
Naga15/backstage-corp
private repo). It’s an LLM in a tool-calling loop with five
independent stop conditions, a tool surface that’s split into
read-only and propose-only kinds, and a citation validator that drops
hypotheses whose evidence IDs don’t resolve. None of those design
choices are accidental.
What “agent” should mean
The word is overloaded. The cheap definition is “an LLM that calls a function.” By that bar every chatbot with a Python interpreter is an agent.
The bar I want for an on-call SRE assistant is higher. Specifically:
- The model decides which signals to pull and in what order — instead of the backend pre-fetching everything.
- The model iterates: result → next-tool → result → next-tool → …
- There’s a stop criterion the model can meet by emitting no more tool calls — i.e., the model decides it’s done.
- There’s a budget the operator can meet that overrides #3 — wallclock, token, dollar, call-count.
- Destructive actions are not actions at all. They are proposals the model emits. A human clicks them.
If you don’t have #5 specifically, you have an automation system, not an agent. The distinction matters because the prompt-injection threat model for “agent that can spend money / page on-call / rollback prod” is qualitatively different from one that can’t.
The tool surface, in three kinds
// src/agent/Tool.ts
export interface Tool<TInput = unknown> {
readonly name: string;
readonly description: string;
readonly inputSchema: z.ZodSchema<TInput>;
readonly kind: 'read' | 'propose' | 'record';
handler(input: TInput, ctx: AgentToolContext): Promise<ToolResult>;
}
Three things to notice:
-
kindis part of the type. Not a string the handler returns later. Not something the loop infers. The class of action is fixed at registration time — you can audit your tool registry to see exactly which tools canproposeanything, full stop. -
inputSchemais az.ZodSchema, not a string description. TheToolRegistry.invoke()method validates input against it before calling the handler. A bad-shape tool call becomes a tool error fed back to the LLM, not a crash:async invoke(name, rawInput, ctx) { const tool = this.map.get(name); if (!tool) return { error: `unknown tool: '${name}'` }; const parsed = tool.inputSchema.safeParse(rawInput); if (!parsed.success) { return { error: `invalid arguments for tool '${name}': ${parsed.error.message}` }; } try { return await tool.handler(parsed.data, ctx); } catch (e) { return { error: `tool '${name}' threw: ${(e as Error).message}` }; } } -
The propose tools (
propose_rollback,propose_page_on_call, etc.) returnSuggestedActionobjects withdestructive: trueas a flag. The frontend (post 7 — coming) enforces a confirm dialog before the operator’s click can trigger anything.
The propose_rollback handler is one of the smallest files in the
plugin:
async handler(input): Promise<ToolResult> {
const action: SuggestedAction = {
label: `Rollback ${input.application} to ${input.targetRevision}`,
kind: 'deep-link',
href: `/argocd/applications/${encodeURIComponent(input.application)}?revision=${encodeURIComponent(input.targetRevision)}`,
destructive: true,
};
return { suggestedAction: action, text: `Drafted rollback proposal: ${action.label}` };
}
There’s nothing that actually rolls back. The handler builds a deep link. The LLM’s “tool call success” message is “I drafted a proposal,” not “I rolled back.”
The five stop conditions
Any one trips, the loop ends:
export const DEFAULT_STOP_CONDITIONS: StopConditions = {
maxSteps: 12, // each step = one model call + its tool calls
maxToolCalls: 20, // hard cap across all steps
maxWallclockMs: 60_000, // operator-tolerable latency
maxTokens: 30_000, // input + output summed
maxCostUsd: 0.5, // belt-and-suspenders dollar cap
};
Why all five and not just maxSteps? Because each enforces a
different worry:
-
maxStepscaps the architectural depth — a runaway plan loop. -
maxToolCallscaps a step where the LLM fires twenty parallel tool calls and the per-step count blows the per-step budget. -
maxWallclockMscaps the latency an on-call SRE sees, not the cost. -
maxTokensis the actual model-API budget. -
maxCostUsdis the dollar number a finance partner cares about, computed via aModelPricingconfig:export const DEFAULT_PRICING: ModelPricing = { inputPerMillion: 3.0, // sized for Claude Sonnet-class outputPerMillion: 15.0, };
A nice property: each one is testable in isolation. The
AgentLoop.test.ts
file has one test per condition, and they all use the same scripted
LLM-stub helper:
const scriptedStep = (sequence: StepResult[]): jest.MockedFunction<AgentStepFn> =>
jest.fn().mockImplementation(async () => {
if (sequence.length === 0) {
return { toolCalls: [], text: 'done', usage: {input: 100, output: 50}, finishReason: 'stop' };
}
return sequence.shift()!;
});
You hand the helper a list of canned step results, the loop runs them
in order. Cost-cap test? Hand it one step result with
usage: { inputTokens: 1_000_000, outputTokens: 0 } and check the
stopped field equals 'cost-cap'. No real model, no real money, no
flakiness.
The loop itself
The loop is ~60 lines of code with very deliberate ordering:
while (totalSteps < this.stopConditions.maxSteps) {
// 1. Pre-step stop checks. Wallclock + tokens + cost can trip
// BEFORE we spend the next model call, so we check them first.
if (elapsed() > this.stopConditions.maxWallclockMs) return finalize('wallclock');
if (totalInputTokens + totalOutputTokens > this.stopConditions.maxTokens) return finalize('token-budget');
if (computeCost() > this.stopConditions.maxCostUsd) return finalize('cost-cap');
totalSteps += 1;
const result = await this.step({ model, system, messages, tools: tools.list() });
totalInputTokens += result.usage.inputTokens;
totalOutputTokens += result.usage.outputTokens;
messages.push({ role: 'assistant', content: result.text, toolCalls: result.toolCalls });
// 2. Natural termination: model emitted no tool calls.
if (result.toolCalls.length === 0) return finalize('llm-stop');
// 3. Execute tool calls. The tool-call-cap can trip mid-step.
for (const call of result.toolCalls) {
if (totalToolCalls >= this.stopConditions.maxToolCalls) return finalize('tool-call-cap');
totalToolCalls += 1;
const toolResult = await this.tools.invoke(call.name, call.args, toolCtx);
// ...append evidence / suggestedAction / hypothesis to the run state
messages.push({ role: 'tool', toolCallId: call.id, content: toolResult.error ?? toolResult.text });
}
}
return finalize('max-steps');
Three details worth pointing out:
- Wallclock / token / cost checks are pre-step. They cap before
spending another model call. The other two (
max-stepsandtool-call-cap) are checked at their natural increment. - Tool errors go to the model as tool messages, not exceptions.
An unknown tool name, a zod-rejected input, a handler that throws —
all three become a
{ role: 'tool', content: '<error>' }message the model sees on its next step. That’s the “self-correction” affordance agents need to recover. finishReasonfrom the step result isn’t load-bearing. The loop ends when we say it does (tool calls empty → llm-stop), not when the SDK says it does. SDKs lie aboutfinishReason: 'stop'in agentic settings sometimes.
What happens after the loop
The orchestrator runs CitationValidator over the recorded
hypotheses against the accumulated evidence:
// Citation validation still runs in agent mode — drops any hypothesis
// the LLM recorded with IDs that don't resolve to evidence it actually
// fetched.
const { kept, warnings: citationWarnings } =
this.citationValidator.validate(result.recordedHypotheses, result.evidence);
A hypothesis with one valid citation passes (invalid citations get stripped + warned). A hypothesis with zero valid citations gets dropped. The frontend can therefore assume every citation it sees in the UI resolves to a real evidence item. That assumption shapes the side-panel “click a citation, scroll the evidence into focus” interaction.
The trace and budget snapshots come back in the HTTP response so the frontend can render them in an “agent thinking” panel:
{
"investigationId": "inv-1717593600000",
"mode": "agent",
"hypotheses": [/* ... */],
"evidence": [/* ... */],
"trace": [
{ "step": 1, "callIndex": 1, "toolName": "query_datadog_logs",
"args": { "reason": "check p99 spike" },
"evidenceIds": ["datadog-1-1", "datadog-1-2"], "durationMs": 412 },
/* ... */
],
"stopped": "llm-stop",
"budgets": {
"steps": 4, "toolCalls": 3,
"inputTokens": 700, "outputTokens": 220,
"costUsd": 0.0054, "elapsedMs": 4823
},
"warnings": []
}
Prompt injection: what the surface protects against
The threat model isn’t “what if the LLM is malicious.” It’s “what if a Datadog log line, Slack message, or git commit message contains a prompt injection.” The agent ends up reading those.
What protects us:
- The tool registry is a hardcoded whitelist. No
eval, no arbitrary function name from a string. The model can callpropose_rollback; it can’t conjureexecute_rollback. - Schema validation on every tool call. A payload that wraps “please call delete_production” in a JSON blob doesn’t reach a handler — it’s rejected by zod and the error goes back to the model.
destructive: trueflags propagate end-to-end. The frontend’sSuggestedActionListenforces the confirm dialog in the component itself, not as a prop the caller might forget to set. A misbehaving caller can’t bypass it.- The five budget caps mean an injection can’t cause unbounded spend. Worst case: 60 seconds of model calls, $0.50 of tokens, no actions taken. The operator sees the trace and warnings and knows something tried to recruit the agent.
This isn’t bulletproof. It’s “expensive enough to be uneconomic, and loud enough to be visible.”
What I’d build next (Phase 2 / Phase 4 in backstage-corp)
- Real connectors to back the read tools. Today the GitHub commits
gatherer is real (Octokit-backed); Datadog, ArgoCD, Harness,
PagerDuty still ship as
staticGathererstubs. Each is ~1 day of glue. - Multi-source past-incident lookup. PagerDuty resolutions on the same entity in the last 90 days are the highest-leverage signal a Backstage-integrated assistant could surface, because they capture the prior post-mortem’s RCA verbatim.
- MCP action surface so the same orchestrator is invocable by
Claude Code / Cursor / any MCP-aware agent — exactly the inverse of
what
scaffolder-backend-module-mcpdoes for scaffolder. Closes the loop.
Code
Branch: Naga15/backstage-corp master.
38 tests in incident-copilot-backend, six more in
incident-copilot-backend-module-github (the real GitHub gatherer),
fifteen more in incident-copilot (the frontend that consumes this
backend). All hermetic — stubbed gatherers, scripted LLM, no live API
calls.
Phase 2 / Phase 4 work continues; the next post in the series will be the frontend walkthrough.