theplatformlog

Catalog-aware LLM Q&A for Backstage (grounded, with citations)

Building catalog-assistant-backend — natural-language questions over your Software Catalog, deliberately wired for the AI Provider Service when it lands.

· Backstage AI plugins, part 3

backstageaillmclaudesoftware-catalogtypescript

“Who owns the payments service?” “What depends on auth-db?” “Which components are tagged tier-1 and use Postgres?”

These are the questions developers ask in Slack channels every day. The answers are already in the Backstage catalog — but you need to know the entity ref, navigate the right page, eyeball the right tab. The catalog has the data; getting it out conversationally is on you.

LLMs are obvious for this. The risk is equally obvious: a hallucinated ownership claim, told confidently, is worse than no answer at all. In an internal developer portal — where “who owns this” determines who you wake up at 3am — that’s not a hypothetical.

This post is about building a small backend plugin that does grounded catalog Q&A: retrieval over real entities, prompted constraint to use only what was retrieved, entity-ref citations in the response. No conversation memory, no embeddings, no streaming yet. Just the load-bearing core.

The shape

One HTTP endpoint, two services:

Architecture diagram

Request:

{ "question": "who owns the payments service?" }

Response:

{
  "answer": "The payments service is owned by group:default/platform.",
  "citations": [
    "component:default/payments-api",
    "api:default/payments"
  ]
}

The citations are the entity refs the LLM was given as context for that answer. The caller can verify them, link to the catalog page, or post-process to filter.

Why keyword retrieval (for now)

The seductive choice is to throw embeddings at this — vectorise every entity, do cosine similarity against the question embedding, retrieve top-K. That’s the right answer eventually. It’s the wrong answer for v1 because:

So the first cut is deterministic and dependency-free:

function scoreEntity(entity: Entity, tokens: string[]): number {
  const corpus = [
    { text: entity.metadata.name,                      weight: 4 },
    { text: entity.metadata.title ?? '',               weight: 3 },
    { text: entity.metadata.description ?? '',         weight: 2 },
    { text: entity.kind,                               weight: 1 },
    { text: String(entity.spec?.type ?? ''),           weight: 1 },
    { text: (entity.metadata.tags ?? []).join(' '),    weight: 2 },
  ];

  let total = 0;
  for (const token of tokens) {
    for (const { text, weight } of corpus) {
      if (!text) continue;
      const haystack = text.toLowerCase();
      if (haystack === token) {
        total += weight * 3;          // exact match
      } else if (haystack.split(/[\s-]+/).includes(token)) {
        total += weight * 2;          // whole-word match inside a string
      } else if (haystack.includes(token)) {
        total += weight;              // substring match
      }
    }
  }
  return total;
}

The retriever is a class with a tiny interface:

export class CatalogContextRetriever {
  constructor(
    private readonly catalog: Pick<CatalogApi, 'getEntities'>,
    private readonly limit: number,
  ) {}

  async retrieve(question, options): Promise<ScoredEntity[]> { ... }
}

That interface is the seam. When embeddings land, a SemanticCatalogContextRetriever implements the same contract and the call site doesn’t change.

The prompt

Two pieces: a tight system prompt that constrains the model, and a user prompt that dumps the retrieved entities.

private static readonly SYSTEM_PROMPT = `You answer questions about a Backstage software catalog.
Use only the catalog entities provided in the user message as your source of
truth. If the entities do not contain the answer, say so plainly — do not
fabricate ownership, dependencies, or relationships.

When you cite an entity, refer to it by its entity reference
(e.g. "component:default/payments-api"). Be concise.`;

The user prompt summarises each retrieved entity into a compact text block:

function summarize(entity: Entity): string {
  const spec = (entity.spec ?? {}) as Record<string, unknown>;
  const lines = [
    `kind: ${entity.kind}`,
    `name: ${entity.metadata.name}`,
  ];
  if (spec.owner)     lines.push(`owner: ${spec.owner}`);
  if (spec.lifecycle) lines.push(`lifecycle: ${spec.lifecycle}`);
  if (spec.system)    lines.push(`system: ${spec.system}`);
  if (Array.isArray(spec.dependsOn) && spec.dependsOn.length > 0) {
    lines.push(`dependsOn: ${spec.dependsOn.join(', ')}`);
  }
  // ... providesApis, consumesApis, tags, etc.
  return lines.join('\n');
}

I deliberately do not dump raw JSON. The line-prefixed key: value format is cheaper in tokens, easier for the model to scan, and easier to extend.

Why the LLM call looks like it does

This is the bit that matters for the future:

import { createAnthropic } from '@ai-sdk/anthropic';
import { generateText } from 'ai';

const anthropic = createAnthropic({ apiKey });
const model = anthropic('claude-sonnet-4-6');

const { text } = await generateText({
  model,
  system: SYSTEM_PROMPT,
  prompt: buildPrompt(question, scored),
  maxOutputTokens: 1024,
});

That’s the Vercel AI SDK shape. Three reasons:

  1. It’s a stable contract. The LanguageModelV3 interface (defined in @ai-sdk/provider) is the same regardless of which provider you swap in. OpenAI, Anthropic, Bedrock, anything that ships an @ai-sdk/* package — same generateText({ model, system, prompt }) call.

  2. Backstage is heading there. BEP-0015 (AI Model Provider Service) is the proposed Backstage-wide abstraction, and it’s built on this same SDK. When the service lands, swapping generateText for provider.getLanguageModelFactory()(modelId) is mechanical. No prompt-construction code changes.

  3. It’s testable without API keys. The generateText function is passed in as a dependency (GenerateTextFn), so tests stub it:

    const generateText = jest.fn().mockResolvedValue({
      text: 'group:platform owns payments-api',
    });
    const svc = new QueryService(retriever, 'mock-model', generateText, logger, 256);

    No network, no flaky tests, no ANTHROPIC_API_KEY in CI.

The router and the Express 4 async gotcha

Standard Backstage router shape — coreServices.httpAuth for credentials, coreServices.httpRouter for mounting — except for one thing:

router.post(
  '/v1/query',
  asyncHandler(async (req, res) => {
    const body = req.body as { question?: unknown } | undefined;
    if (!body || typeof body.question !== 'string') {
      throw new InputError('Request body must include a string `question`');
    }
    // ... call queryService, return JSON
  }),
);

function asyncHandler(fn) {
  return (req, res, next) => {
    Promise.resolve(fn(req, res, next)).catch(next);
  };
}

Express 4 does not propagate rejected promises from async handlers to error middleware. The fix is a 4-line wrapper. Tests caught this within 30 seconds; the real Backstage httpRouter service wraps everything for you in production. Still good to keep the wrapper around because the unit test app uses a bare express instance.

What I’m deliberately not doing yet

Tests as the spec

13 unit tests across three files. Two patterns worth lifting:

  1. The retriever is tested without the catalog backend. A jest.fn() stubs getEntities and returns a list of fixture entities. The retriever’s scoring logic is exercised in isolation. No catalog server needed.

  2. The query service is tested without the LLM. generateText is injected as a function dependency, mocked per test. Assertions look at the prompt that would have been sent to the LLM — that’s where the grounded-prompt contract lives, and it’s where regressions would show up first.

it('passes a grounded prompt to generateText and returns the answer', async () => {
  const generateText = jest.fn()
    .mockResolvedValue({ text: 'group:platform owns payments-api' });
  const svc = new QueryService(/* ... */);

  const result = await svc.query('who owns payments?');

  expect(result.citations).toEqual(['component:default/payments-api']);

  const call = generateText.mock.calls[0][0];
  expect(call.system).toMatch(/Backstage software catalog/);
  expect(call.prompt).toContain('component:default/payments-api');
  expect(call.prompt).toContain('owner: group:platform');
  expect(call.prompt).toContain('Question: who owns payments?');
});

Code

Branch lives at Naga15/backstage feat/catalog-assistant-backend. 13 tests, lint clean. Not PR’d upstream yet — the right time is after BEP-0015 lands, so the plugin can ship referencing the official provider service instead of @ai-sdk/anthropic directly.

Where this fits

Two plugins shipped in this series so far:

These are the producer-side and consumer-side of the same idea: AI is a new modality for interacting with Backstage data. Different audiences, shared infrastructure.

Next post (the last in this series): the four-part AI plugin roadmap, where each piece slots into Backstage’s RFC track, and what the next two plugins will do.