Evals — regression-test skills with deterministic LLM stubs

Tweaking a PromptSkill, adjusting agent instructions, or adding a tool can silently break previously-working cases. Evals freeze expected behaviour in a spec, run on every build, and catch regressions immediately.

Design

defineEval(spec) — declare an eval.
runEvals(specs, { llm }) — run a batch and return an EvalReport.
MockLLM — deterministic LLM provider; give it a sequence of responses (or a function) and it returns them in order.
mockToolCall(tool, args) — produce a CompleteResult that emits a tool call.

Evals use a trace pattern: while running the skill, every LLM call, tool call, subtitle, surface, and error is recorded; assertions run against the trace.

EvalSpec

import { defineEval, runEvals, MockLLM, mockToolCall } from '@perhapxin/dddk';
import type { EvalSpec, EvalAssertion, EvalTrace } from '@perhapxin/dddk';

interface EvalSpec {
  name: string;
  skill: Skill;
  userInput: string;
  vars?: Record<string, string>;
  assertions: EvalAssertion[];
  skip?: string;
}

Assertion kind	Passes when
`{ kind: 'includes', substring }`	`trace.text` or any subtitle contains the substring
`{ kind: 'matches', pattern: RegExp }`	Same, but with a regex
`{ kind: 'callsTool', tool }`	A matching tool call appears in the trace
`{ kind: 'doesNotCallTool', tool }`	No matching tool call appears
`{ kind: 'meetsCriteria', description, check(trace) => boolean }`	Custom predicate

EvalTrace exposes:

interface EvalTrace {
  text: string;
  toolCalls: Array<{ tool: string; args: Record<string, unknown> }>;
  subtitles: string[];
  surfaces: unknown[];
  errors: string[];
  llmCalls: number;
}

Example: three evals guarding one skill

Suppose we have a translate PromptSkill:

import type { PromptSkill } from '@perhapxin/dddk';

const translate: PromptSkill = {
  id: 'translate',
  type: 'prompt',
  name: 'Translate',
  prompt: 'Translate the user message into {{language}}. When done, call show_subtitle to inform the user that the translation is complete. If the user message is empty, call show_subtitle to ask what they want translated.',
};

Three evals to hold the behaviour:

const evals: EvalSpec[] = [
  defineEval({
    name: 'translate: empty input prompts the user',
    skill: translate,
    userInput: '',
    vars: { language: 'English' },
    assertions: [
      { kind: 'callsTool', tool: 'show_subtitle' },
      { kind: 'includes',  substring: 'translate' },
    ],
  }),

  defineEval({
    name: 'translate: normal input emits "done" subtitle',
    skill: translate,
    userInput: '你好',
    vars: { language: 'English' },
    assertions: [
      { kind: 'callsTool', tool: 'show_subtitle' },
      { kind: 'includes',  substring: 'complete' },
    ],
  }),

  defineEval({
    name: 'translate: should not invoke other tools',
    skill: translate,
    userInput: '你好',
    vars: { language: 'English' },
    assertions: [
      { kind: 'doesNotCallTool', tool: 'navigate' },
      { kind: 'doesNotCallTool', tool: 'agent' },
    ],
  }),
];

Run it:

const llm = new MockLLM({
  responses: [
    // eval #1: empty input → prompt
    { ...mockToolCall('show_subtitle', { text: 'Please type what to translate' }), content: '' },
    // eval #2: "你好" translated to English → done subtitle
    { ...mockToolCall('show_subtitle', { text: 'Translation complete: Hello' }), content: 'Hello' },
    // eval #3: same as #2
    { ...mockToolCall('show_subtitle', { text: 'Translation complete: Hello' }), content: 'Hello' },
  ],
});

const report = await runEvals(evals, { llm, verbose: true });
console.log(`${report.passed}/${report.total} passed`);
if (report.failures > 0) process.exit(1);

`MockLLM`

new MockLLM({
  responses: [
    'plain text',                                       // → { content: 'plain text', finishReason: 'stop' }
    { content: 'hi', usage: { input: 10, output: 2 } }, // partial CompleteResult
    mockToolCall('search_catalog', { query: 'shirt' }), // a tool call
  ],
});

Functions are also supported for conditional responses:

new MockLLM({
  responses: (opts, callIndex) => {
    const lastUserMsg = opts.messages.findLast((m) => m.role === 'user')?.content ?? '';
    if (typeof lastUserMsg === 'string' && lastUserMsg.includes('refund')) {
      return mockToolCall('qa_lookup', { question: lastUserMsg });
    }
    return { content: `OK ${callIndex}` };
  },
});

In array mode the last entry repeats — calls beyond the array length all return the last item.

Runner

const report = await runEvals(evals, {
  llm,                          // required
  toolMocks: {                  // optional: mock ctx.llm and other ambient helpers
    llm: (prompt) => `mocked llm: ${prompt}`,
  },
  verbose: true,                // default true; prints each result
});

interface EvalReport {
  total: number;
  passed: number;
  failed: number;
  skipped: number;
  failures: number;             // alias of failed; convenient for CI
  results: EvalResult[];        // each EvalResult carries trace + failures[]
}

The failures alias exists for CI ergonomics (if (report.failures > 0) process.exit(1) reads more naturally).

How each skill type is driven

skill.type	What runEvals does
`script`	No LLM call; collects each step's `subtitle` into the trace.
`prompt`	Substitutes `vars` into `prompt`, then makes one `llm.complete({ system, user: userInput })` call.
`action`	Calls `handler(stubCtx)`; `palette.replace` / `navigate` / `agent` calls are captured as `trace.toolCalls`.
`surface`	Calls `build(stubCtx)`; the result lands in `trace.surfaces[]`.

PanelSkill isn't driven directly by the runner — its lifecycle (onEnter → onInput → onAction) needs a host-supplied test harness.

Skip

defineEval({ name: 'not implemented yet', skill, userInput: '...', assertions: [], skip: 'API not deployed' });

Skipped evals don't run, don't fail, and still appear in the report.

Eval vs integration test

Use eval	Use integration test
Skill / prompt / tool behaviour	Full UI flow (palette open, key press, render)
Pure LLM-driven decisions	DOM interaction, CSS, routing
Want many cases in seconds	A few happy paths over minutes
Run without a real LLM key (CI)	Run end-to-end fixtures
Mock deterministic input → output	Real user agent

You need both — evals catch "how the LLM uses tools," integration tests catch browser behaviour.