Catalog-aware LLM Q&A for Backstage (grounded, with citations)

Building catalog-assistant-backend — natural-language questions over your Software Catalog, deliberately wired for the AI Provider Service when it lands.

Jun 3, 2026 · Backstage AI plugins, part 3

backstageaillmclaudesoftware-catalogtypescript

“Who owns the payments service?” “What depends on auth-db?” “Which components are tagged tier-1 and use Postgres?”

These are the questions developers ask in Slack channels every day. The answers are already in the Backstage catalog — but you need to know the entity ref, navigate the right page, eyeball the right tab. The catalog has the data; getting it out conversationally is on you.

LLMs are obvious for this. The risk is equally obvious: a hallucinated ownership claim, told confidently, is worse than no answer at all. In an internal developer portal — where “who owns this” determines who you wake up at 3am — that’s not a hypothetical.

This post is about building a small backend plugin that does grounded catalog Q&A: retrieval over real entities, prompted constraint to use only what was retrieved, entity-ref citations in the response. No conversation memory, no embeddings, no streaming yet. Just the load-bearing core.

The shape

One HTTP endpoint, two services:

Architecture diagram

Request:

{ "question": "who owns the payments service?" }

Response:

{
  "answer": "The payments service is owned by group:default/platform.",
  "citations": [
    "component:default/payments-api",
    "api:default/payments"
  ]
}

The citations are the entity refs the LLM was given as context for that answer. The caller can verify them, link to the catalog page, or post-process to filter.

Why keyword retrieval (for now)

The seductive choice is to throw embeddings at this — vectorise every entity, do cosine similarity against the question embedding, retrieve top-K. That’s the right answer eventually. It’s the wrong answer for v1 because:

It needs an embedding model (extra dep), an embedding store (Postgres with pgvector or a sidecar), and a backfill job to vectorise existing entities. Three new operational surfaces before the first response.
For most catalog questions, keyword overlap is already good enough. “Who owns payments?” lands on payments-api, payments-db, payments-worker immediately. The LLM does the rest.

So the first cut is deterministic and dependency-free:

function scoreEntity(entity: Entity, tokens: string[]): number {
  const corpus = [
    { text: entity.metadata.name,                      weight: 4 },
    { text: entity.metadata.title ?? '',               weight: 3 },
    { text: entity.metadata.description ?? '',         weight: 2 },
    { text: entity.kind,                               weight: 1 },
    { text: String(entity.spec?.type ?? ''),           weight: 1 },
    { text: (entity.metadata.tags ?? []).join(' '),    weight: 2 },
  ];

  let total = 0;
  for (const token of tokens) {
    for (const { text, weight } of corpus) {
      if (!text) continue;
      const haystack = text.toLowerCase();
      if (haystack === token) {
        total += weight * 3;          // exact match
      } else if (haystack.split(/[\s-]+/).includes(token)) {
        total += weight * 2;          // whole-word match inside a string
      } else if (haystack.includes(token)) {
        total += weight;              // substring match
      }
    }
  }
  return total;
}

The retriever is a class with a tiny interface:

export class CatalogContextRetriever {
  constructor(
    private readonly catalog: Pick<CatalogApi, 'getEntities'>,
    private readonly limit: number,
  ) {}

  async retrieve(question, options): Promise<ScoredEntity[]> { ... }
}

That interface is the seam. When embeddings land, a SemanticCatalogContextRetriever implements the same contract and the call site doesn’t change.

The prompt

Two pieces: a tight system prompt that constrains the model, and a user prompt that dumps the retrieved entities.

private static readonly SYSTEM_PROMPT = `You answer questions about a Backstage software catalog.
Use only the catalog entities provided in the user message as your source of
truth. If the entities do not contain the answer, say so plainly — do not
fabricate ownership, dependencies, or relationships.

When you cite an entity, refer to it by its entity reference
(e.g. "component:default/payments-api"). Be concise.`;

The user prompt summarises each retrieved entity into a compact text block:

function summarize(entity: Entity): string {
  const spec = (entity.spec ?? {}) as Record<string, unknown>;
  const lines = [
    `kind: ${entity.kind}`,
    `name: ${entity.metadata.name}`,
  ];
  if (spec.owner)     lines.push(`owner: ${spec.owner}`);
  if (spec.lifecycle) lines.push(`lifecycle: ${spec.lifecycle}`);
  if (spec.system)    lines.push(`system: ${spec.system}`);
  if (Array.isArray(spec.dependsOn) && spec.dependsOn.length > 0) {
    lines.push(`dependsOn: ${spec.dependsOn.join(', ')}`);
  }
  // ... providesApis, consumesApis, tags, etc.
  return lines.join('\n');
}

I deliberately do not dump raw JSON. The line-prefixed key: value format is cheaper in tokens, easier for the model to scan, and easier to extend.

Why the LLM call looks like it does

This is the bit that matters for the future:

import { createAnthropic } from '@ai-sdk/anthropic';
import { generateText } from 'ai';

const anthropic = createAnthropic({ apiKey });
const model = anthropic('claude-sonnet-4-6');

const { text } = await generateText({
  model,
  system: SYSTEM_PROMPT,
  prompt: buildPrompt(question, scored),
  maxOutputTokens: 1024,
});

That’s the Vercel AI SDK shape. Three reasons:

It’s a stable contract. The LanguageModelV3 interface (defined in @ai-sdk/provider) is the same regardless of which provider you swap in. OpenAI, Anthropic, Bedrock, anything that ships an @ai-sdk/* package — same generateText({ model, system, prompt }) call.
Backstage is heading there. BEP-0015 (AI Model Provider Service) is the proposed Backstage-wide abstraction, and it’s built on this same SDK. When the service lands, swapping generateText for provider.getLanguageModelFactory()(modelId) is mechanical. No prompt-construction code changes.

It’s testable without API keys. The generateText function is passed in as a dependency (GenerateTextFn), so tests stub it:

const generateText = jest.fn().mockResolvedValue({
  text: 'group:platform owns payments-api',
});
const svc = new QueryService(retriever, 'mock-model', generateText, logger, 256);

No network, no flaky tests, no ANTHROPIC_API_KEY in CI.

Run it on free, open-source models

That stable-contract point isn’t theoretical — it’s a cost lever. Because the provider is just an @ai-sdk/* factory, the plugin takes a provider and an optional baseURL, so you can point it at any OpenAI-API-compatible endpoint. The one I care about most: Ollama running Gemma 3 locally — free, open-weights, on your own hardware.

ollama pull gemma3

catalogAssistant:
  provider: openai
  model: gemma3                        # open weights; or gemma3:27b
  baseURL: http://localhost:11434/v1   # Ollama — $0/token, no vendor, no key
  apiKey: ollama                       # any non-empty value; Ollama ignores it

No API key, no per-token bill, no data leaving the box. Groq and OpenRouter work the same way (just a different baseURL) if you’d rather not self-host, and Google’s gemini-2.5-flash has a generous free tier.

The reason a small free model is enough here specifically: retrieval is deterministic and the prompt is already grounded in real catalog entities, so the model isn’t being asked to know anything — just to phrase an answer from the entities in front of it. That’s well within Gemma 3’s range. Reserve a frontier model for the genuinely hard cases; for “who owns payments-api,” free and local is the right default.

The router and the Express 4 async gotcha

Standard Backstage router shape — coreServices.httpAuth for credentials, coreServices.httpRouter for mounting — except for one thing:

router.post(
  '/v1/query',
  asyncHandler(async (req, res) => {
    const body = req.body as { question?: unknown } | undefined;
    if (!body || typeof body.question !== 'string') {
      throw new InputError('Request body must include a string `question`');
    }
    // ... call queryService, return JSON
  }),
);

function asyncHandler(fn) {
  return (req, res, next) => {
    Promise.resolve(fn(req, res, next)).catch(next);
  };
}

Express 4 does not propagate rejected promises from async handlers to error middleware. The fix is a 4-line wrapper. Tests caught this within 30 seconds; the real Backstage httpRouter service wraps everything for you in production. Still good to keep the wrapper around because the unit test app uses a bare express instance.

What I’m deliberately not doing yet

No frontend. That’s a separate plugin (@backstage/plugin-catalog-assistant) for a chat UI. The backend is the load-bearing piece — UI follows.
No conversation memory. Each request is one-shot. Multi-turn requires a session store, which is more product than core. Worth doing only when the UI exists.
No tool use. The model can’t fetch additional entities mid-answer. Once BEP-0015 ships and exposes tool-use over the REST surface, the assistant graduates from “retrieve once, answer” to “follow relations as needed.” That’s the moment compound questions like “services tagged X that depend on Y” start working well.
No streaming. v1 returns the full response in one body. SSE / WebSocket streaming lands with the UI.

Tests as the spec

13 unit tests across three files. Two patterns worth lifting:

The retriever is tested without the catalog backend. A jest.fn() stubs getEntities and returns a list of fixture entities. The retriever’s scoring logic is exercised in isolation. No catalog server needed.
The query service is tested without the LLM. generateText is injected as a function dependency, mocked per test. Assertions look at the prompt that would have been sent to the LLM — that’s where the grounded-prompt contract lives, and it’s where regressions would show up first.

it('passes a grounded prompt to generateText and returns the answer', async () => {
  const generateText = jest.fn()
    .mockResolvedValue({ text: 'group:platform owns payments-api' });
  const svc = new QueryService(/* ... */);

  const result = await svc.query('who owns payments?');

  expect(result.citations).toEqual(['component:default/payments-api']);

  const call = generateText.mock.calls[0][0];
  expect(call.system).toMatch(/Backstage software catalog/);
  expect(call.prompt).toContain('component:default/payments-api');
  expect(call.prompt).toContain('owner: group:platform');
  expect(call.prompt).toContain('Question: who owns payments?');
});

Install

yarn --cwd packages/backend add @theplatformlog/catalog-assistant-backend

// packages/backend/src/index.ts
backend.add(import('@theplatformlog/catalog-assistant-backend'));

# app-config.yaml
catalogAssistant:
  anthropicApiKey: ${ANTHROPIC_API_KEY}

The upstream Backstage PR is #34491. On merge it’ll publish as @backstage/plugin-catalog-assistant-backend — swap then if you want to track upstream.

Code

Branch lives at Naga15/backstage feat/catalog-assistant-backend. 13 tests, lint clean. Upstream draft PR #34491 is open for review.

Where this fits

Two plugins shipped in this series so far:

scaffolder-backend-module-mcp — Backstage as MCP client for scaffolder templates.
catalog-assistant-backend — grounded LLM Q&A over the Software Catalog.

These are the producer-side and consumer-side of the same idea: AI is a new modality for interacting with Backstage data. Different audiences, shared infrastructure.

Next post (the last in this series): the four-part AI plugin roadmap, where each piece slots into Backstage’s RFC track, and what the next two plugins will do.