Catalog-aware LLM Q&A for Backstage (grounded, with citations)
Building catalog-assistant-backend — natural-language questions over your Software Catalog, deliberately wired for the AI Provider Service when it lands.
backstageaillmclaudesoftware-catalogtypescript
“Who owns the payments service?” “What depends on
auth-db?” “Which components are tagged tier-1 and use Postgres?”
These are the questions developers ask in Slack channels every day. The answers are already in the Backstage catalog — but you need to know the entity ref, navigate the right page, eyeball the right tab. The catalog has the data; getting it out conversationally is on you.
LLMs are obvious for this. The risk is equally obvious: a hallucinated ownership claim, told confidently, is worse than no answer at all. In an internal developer portal — where “who owns this” determines who you wake up at 3am — that’s not a hypothetical.
This post is about building a small backend plugin that does grounded catalog Q&A: retrieval over real entities, prompted constraint to use only what was retrieved, entity-ref citations in the response. No conversation memory, no embeddings, no streaming yet. Just the load-bearing core.
The shape
One HTTP endpoint, two services:
Request:
{ "question": "who owns the payments service?" }
Response:
{
"answer": "The payments service is owned by group:default/platform.",
"citations": [
"component:default/payments-api",
"api:default/payments"
]
}
The citations are the entity refs the LLM was given as context for that answer. The caller can verify them, link to the catalog page, or post-process to filter.
Why keyword retrieval (for now)
The seductive choice is to throw embeddings at this — vectorise every entity, do cosine similarity against the question embedding, retrieve top-K. That’s the right answer eventually. It’s the wrong answer for v1 because:
- It needs an embedding model (extra dep), an embedding store (Postgres with pgvector or a sidecar), and a backfill job to vectorise existing entities. Three new operational surfaces before the first response.
- For most catalog questions, keyword overlap is already good enough.
“Who owns payments?” lands on
payments-api,payments-db,payments-workerimmediately. The LLM does the rest.
So the first cut is deterministic and dependency-free:
function scoreEntity(entity: Entity, tokens: string[]): number {
const corpus = [
{ text: entity.metadata.name, weight: 4 },
{ text: entity.metadata.title ?? '', weight: 3 },
{ text: entity.metadata.description ?? '', weight: 2 },
{ text: entity.kind, weight: 1 },
{ text: String(entity.spec?.type ?? ''), weight: 1 },
{ text: (entity.metadata.tags ?? []).join(' '), weight: 2 },
];
let total = 0;
for (const token of tokens) {
for (const { text, weight } of corpus) {
if (!text) continue;
const haystack = text.toLowerCase();
if (haystack === token) {
total += weight * 3; // exact match
} else if (haystack.split(/[\s-]+/).includes(token)) {
total += weight * 2; // whole-word match inside a string
} else if (haystack.includes(token)) {
total += weight; // substring match
}
}
}
return total;
}
The retriever is a class with a tiny interface:
export class CatalogContextRetriever {
constructor(
private readonly catalog: Pick<CatalogApi, 'getEntities'>,
private readonly limit: number,
) {}
async retrieve(question, options): Promise<ScoredEntity[]> { ... }
}
That interface is the seam. When embeddings land, a SemanticCatalogContextRetriever
implements the same contract and the call site doesn’t change.
The prompt
Two pieces: a tight system prompt that constrains the model, and a user prompt that dumps the retrieved entities.
private static readonly SYSTEM_PROMPT = `You answer questions about a Backstage software catalog.
Use only the catalog entities provided in the user message as your source of
truth. If the entities do not contain the answer, say so plainly — do not
fabricate ownership, dependencies, or relationships.
When you cite an entity, refer to it by its entity reference
(e.g. "component:default/payments-api"). Be concise.`;
The user prompt summarises each retrieved entity into a compact text block:
function summarize(entity: Entity): string {
const spec = (entity.spec ?? {}) as Record<string, unknown>;
const lines = [
`kind: ${entity.kind}`,
`name: ${entity.metadata.name}`,
];
if (spec.owner) lines.push(`owner: ${spec.owner}`);
if (spec.lifecycle) lines.push(`lifecycle: ${spec.lifecycle}`);
if (spec.system) lines.push(`system: ${spec.system}`);
if (Array.isArray(spec.dependsOn) && spec.dependsOn.length > 0) {
lines.push(`dependsOn: ${spec.dependsOn.join(', ')}`);
}
// ... providesApis, consumesApis, tags, etc.
return lines.join('\n');
}
I deliberately do not dump raw JSON. The line-prefixed key: value format
is cheaper in tokens, easier for the model to scan, and easier to extend.
Why the LLM call looks like it does
This is the bit that matters for the future:
import { createAnthropic } from '@ai-sdk/anthropic';
import { generateText } from 'ai';
const anthropic = createAnthropic({ apiKey });
const model = anthropic('claude-sonnet-4-6');
const { text } = await generateText({
model,
system: SYSTEM_PROMPT,
prompt: buildPrompt(question, scored),
maxOutputTokens: 1024,
});
That’s the Vercel AI SDK shape. Three reasons:
-
It’s a stable contract. The
LanguageModelV3interface (defined in@ai-sdk/provider) is the same regardless of which provider you swap in. OpenAI, Anthropic, Bedrock, anything that ships an@ai-sdk/*package — samegenerateText({ model, system, prompt })call. -
Backstage is heading there. BEP-0015 (AI Model Provider Service) is the proposed Backstage-wide abstraction, and it’s built on this same SDK. When the service lands, swapping
generateTextforprovider.getLanguageModelFactory()(modelId)is mechanical. No prompt-construction code changes. -
It’s testable without API keys. The
generateTextfunction is passed in as a dependency (GenerateTextFn), so tests stub it:const generateText = jest.fn().mockResolvedValue({ text: 'group:platform owns payments-api', }); const svc = new QueryService(retriever, 'mock-model', generateText, logger, 256);No network, no flaky tests, no ANTHROPIC_API_KEY in CI.
The router and the Express 4 async gotcha
Standard Backstage router shape — coreServices.httpAuth for credentials,
coreServices.httpRouter for mounting — except for one thing:
router.post(
'/v1/query',
asyncHandler(async (req, res) => {
const body = req.body as { question?: unknown } | undefined;
if (!body || typeof body.question !== 'string') {
throw new InputError('Request body must include a string `question`');
}
// ... call queryService, return JSON
}),
);
function asyncHandler(fn) {
return (req, res, next) => {
Promise.resolve(fn(req, res, next)).catch(next);
};
}
Express 4 does not propagate rejected promises from async handlers to
error middleware. The fix is a 4-line wrapper. Tests caught this within
30 seconds; the real Backstage httpRouter service wraps everything for
you in production. Still good to keep the wrapper around because the unit
test app uses a bare express instance.
What I’m deliberately not doing yet
- No frontend. That’s a separate plugin (
@backstage/plugin-catalog-assistant) for a chat UI. The backend is the load-bearing piece — UI follows. - No conversation memory. Each request is one-shot. Multi-turn requires a session store, which is more product than core. Worth doing only when the UI exists.
- No tool use. The model can’t fetch additional entities mid-answer. Once BEP-0015 ships and exposes tool-use over the REST surface, the assistant graduates from “retrieve once, answer” to “follow relations as needed.” That’s the moment compound questions like “services tagged X that depend on Y” start working well.
- No streaming. v1 returns the full response in one body. SSE / WebSocket streaming lands with the UI.
Tests as the spec
13 unit tests across three files. Two patterns worth lifting:
-
The retriever is tested without the catalog backend. A
jest.fn()stubsgetEntitiesand returns a list of fixture entities. The retriever’s scoring logic is exercised in isolation. No catalog server needed. -
The query service is tested without the LLM.
generateTextis injected as a function dependency, mocked per test. Assertions look at the prompt that would have been sent to the LLM — that’s where the grounded-prompt contract lives, and it’s where regressions would show up first.
it('passes a grounded prompt to generateText and returns the answer', async () => {
const generateText = jest.fn()
.mockResolvedValue({ text: 'group:platform owns payments-api' });
const svc = new QueryService(/* ... */);
const result = await svc.query('who owns payments?');
expect(result.citations).toEqual(['component:default/payments-api']);
const call = generateText.mock.calls[0][0];
expect(call.system).toMatch(/Backstage software catalog/);
expect(call.prompt).toContain('component:default/payments-api');
expect(call.prompt).toContain('owner: group:platform');
expect(call.prompt).toContain('Question: who owns payments?');
});
Code
Branch lives at
Naga15/backstage feat/catalog-assistant-backend.
13 tests, lint clean. Not PR’d upstream yet — the right time is after
BEP-0015 lands, so the plugin can ship referencing the official provider
service instead of @ai-sdk/anthropic directly.
Where this fits
Two plugins shipped in this series so far:
scaffolder-backend-module-mcp— Backstage as MCP client for scaffolder templates.catalog-assistant-backend— grounded LLM Q&A over the Software Catalog.
These are the producer-side and consumer-side of the same idea: AI is a new modality for interacting with Backstage data. Different audiences, shared infrastructure.
Next post (the last in this series): the four-part AI plugin roadmap, where each piece slots into Backstage’s RFC track, and what the next two plugins will do.